[ ]
, .loc
and .iloc
This is the beginning of a four-part series on how to select subsets of data from a pandas DataFrame or Series. Pandas offers a wide variety of options for subset selection which necessitates multiple articles. This series is broken down into the following topics.
[]
, .loc
and .iloc
If you want to be trusted to make decisions using pandas and scikit-learn, you must become an expert. I have completely mastered both libraries and have developed special techniques that will massively improve your ability and efficiency to do data analysis and machine learning.
These series of articles assume you have no knowledge of pandas, but that you understand the fundamentals of the Python programming language. It also assumes that you have installed pandas on your machine.
The easiest way to get pandas along with Python and the rest of the main scientific computing libraries is to install the Anaconda distribution.
If you have no knowledge of Python then I suggest reading Master the Fundamentals of Python first.
You might be wondering why there need to be so many articles on selecting subsets of data. This topic is extremely important to pandas and it’s unfortunate that it is fairly complicated because subset selection happens frequently during an actual analysis. Because you are frequently making subset selections, you need to master it in order to make your life with pandas easier.
The material in this article is also covered in the official pandas documentation on Indexing and Selecting Data. I highly recommend that you read that part of the documentation along with this tutorial. In fact, the documentation is one of the primary means for mastering pandas. I wrote a step-by-step article, How to Learn Pandas, which gives suggestions on how to use the documentation as you master pandas.
The pandas library has two primary containers of data, the DataFrame and the Series. You will spend nearly all your time working with both of the objects when you use pandas. The DataFrame is used more than the Series, so let’s take a look at an image of it first.
This image comes with some added illustrations to highlight its components. At first glance, the DataFrame looks like any other two-dimensional table of data that you have seen. It has rows and it has columns. Technically, there are three main components of the DataFrame.
A DataFrame is composed of three different components, the index, columns, and the data. The data is also known as the values.
The index represents the sequence of values on the far left-hand side of the DataFrame. All the values in the index are in bold font. Each individual value of the index is called a label. Sometimes the index is referred to as the row labels. In the example above, the row labels are not very interesting and are just the integers beginning from 0 up to n-1, where n is the number of rows in the table. Pandas defaults DataFrames with this simple index.
The columns are the sequence of values at the very top of the DataFrame. They are also in bold font. Each individual value of the columns is called a column, but can also be referred to as column name or column label.
Everything else not in bold font is the data or values. You will sometimes hear DataFrames referred to as tabular data. This is just another name for a rectangular table data with rows and columns.
It is also common terminology to refer to the rows or columns as an axis. Collectively, we call them axes. So, a row is an axis and a column is another axis.
The word axis appears as a parameter in many DataFrame methods. Pandas allows you to choose the direction of how the method will work with this parameter. This has nothing to do with subset selection so you can just ignore it for now.
The main takeaway from the DataFrame anatomy is that each row has a label and each column has a label. These labels are used to refer to specific rows or columns in the DataFrame. It’s the same as how humans use names to refer to specific people.
Master Data Analysis with Python is an extremely comprehensive course that will help you learn pandas to do data analysis.
I believe that it is the best possible resource available for learning how to data analysis with pandas and provide a 30-day 100% money back guarantee if you are not satisfied.
Before we start doing subset selection, it might be good to define what it is. Subset selection is simply selecting particular rows and columns of data from a DataFrame (or Series). This could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns.
Let’s see some images of subset selection. We will first look at a sample DataFrame with fake data.
Let’s say we want to select just the columns color
, age
, and height
but keep all the rows.
Our final DataFrame would look like this:
We can also make selections that select just some of the rows. Let’s select the rows with labels Aaron
and Dean
along with all of the columns:
Our final DataFrame would like:
Let’s combine the selections from above and select the columns color
, age
, and height
for only the rows with labels Aaron
and Dean
.
Our final DataFrame would look like this:
We already mentioned that each row and each column have a specific label that can be used to reference them. This is displayed in bold font in the DataFrame.
But, what hasn’t been mentioned, is that each row and column may be referenced by an integer as well. I call this integer location. The integer location begins at 0 and ends at n-1 for each row and column. Take a look above at our sample DataFrame one more time.
The rows with labels Aaron
and Dean
can also be referenced by their respective integer locations 2 and 4. Similarly, the columns color
, age
andheight
can be referenced by their integer locations 1, 3, and 4.
The documentation refers to integer location as position. I don’t particularly like this terminology as it's not as explicit as integer location. The key thing term here is INTEGER.
The documentation uses the term indexing frequently. This term is essentially just a one-word phrase to say ‘subset selection’. I prefer the term subset selection as, again, it is more descriptive of what is actually happening. Indexing is also the term used in the official Python documentation.
[]
, .loc
, and .
iloc
There are many ways to select subsets of data, but in this article, we will only cover the usage of the square brackets ([]
), .loc
and .iloc
. Collectively, they are called the indexers. These are by far the most common ways to select data. A different part of this Series will discuss a few methods that can be used to make subset selections.
If you have a DataFrame, df
, your subset selection will look something like the following:
df[ ]
df.loc[ ]
df.iloc[ ]
An actual subset selection will have something inside of the square brackets. All selections in this article will take place inside of those square brackets.
Notice that the square brackets also follow .loc
and .il
oc
. All indexing in Python happens inside of these square brackets.
The term indexing operator is used to refer to the square brackets following an object. The .loc
and .iloc
indexers also use the indexing operator to make selections. I will use the term just the indexing operator to refer todf[]
. This will distinguish it from df.loc[]
and df.iloc[]
.
read_csv
Let’s begin using pandas to read in a DataFrame, and from there, use the indexing operator by itself to select subsets of data. All the data for these tutorials are in the data directory.
We will use the read_csv
function to read in data into a DataFrame. We pass the path to the file as the first argument to the function. We will also use the index_col
parameter to select the first column of data as the index (more on this later).
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.read_csv('data/sample_data.csv', index_col=0)
>>> df
Earlier, we mentioned the three components of the DataFrame. The index, columns and data (values). We can extract each of these components into their own variables. Let’s do that and then inspect them:
>>> index = df.index
>>> columns = df.columns
>>> values = df.values
>>> indexIndex(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina',
'Cornelia'], dtype='object')
>>> columnsIndex(['state', 'color', 'food', 'age', 'height', 'score'],
dtype='object')
>>> valuesarray
([['NY', 'blue', '
Steak
', 30, 165, 4.6],
['TX', 'green', 'Lamb', 2, 70, 8.3],
['FL', 'red', 'Mango', 12, 120, 9.0],
['AL', 'white', 'Apple', 4, 80, 3.3],
['AK', 'gray', 'Cheese', 32, 180, 1.8],
['TX', 'black', 'Melon', 33, 172, 9.5],
['TX', 'red', 'Beans', 69, 150, 2.2]], dtype=object)
Let’s output the type of each component to understand exactly what kind of object they are.
>>> type(index)pandas.core.indexes.base.Index
>>> type(columns)pandas.core.indexes.base.Index
>>> type(values)
numpy.ndarray
Interestingly, both the index and the columns are the same type. They are both a pandas Index
object. This object is quite powerful in itself, but for now, you can just think of it as a sequence of labels for either the rows or the columns.
The values are a NumPy ndarray
, which stands for n-dimensional array, and is the primary container of data in the NumPy library. Pandas is built directly on top of NumPy and it's this array that is responsible for the bulk of the workload.
We will begin our journey of selecting subsets by using just the indexing operator on a DataFrame. Its main purpose is to select a single column or multiple columns of data.
To select a single column of data, simply put the name of the column in-between the brackets. Let’s select the food column:
>>> df['food']
Jane Steak
Niko Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
Selecting a single column of data returns the other pandas data container, the Series. A Series is a one-dimensional sequence of labeled data. There are two main components of a Series, the index and the data(or values). There are NO columns in a Series.
The visual display of a Series is just plain text, as opposed to the nicely styled table for DataFrames. The sequence of person names on the left is the index. The sequence of food items on the right is the values.
You will also notice two extra pieces of data on the bottom of the Series. The name of the Series becomes the old-column name. You will also see the data type or dtype
of the Series. You can ignore both these items for now.
It’s possible to select multiple columns with just the indexing operator by passing it a list of column names. Let’s select color
, food
, and score
:
>>> df[['color', 'food', 'score']]
Selecting multiple columns returns a DataFrame. You can actually select a single column as a DataFrame with a one-item list:
df[['food']]
Although this resembles the Series from above, it is technically a DataFrame, a different object.
When selecting multiple columns, you can select them in any order that you choose. It doesn’t have to be the same order as the original DataFrame. For instance, let’s select height
and color
.
df[['height', 'color']]
There are a couple of common exceptions that arise when doing selections with just the indexing operator.
KeyError
KeyError
>>> df['hight']KeyError: 'hight'
>>> df['color', 'age'] # should be: df[['color', 'age']]KeyError: ('color', 'age')
df['
col_name
']
df[['col_name1', 'col_name2']]
.loc
The .loc
indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns. Most importantly, it only selects data by the LABEL of the rows and columns.
.loc
The .loc
indexer returns a single row as a Series when given a single row label. Let's select the row for Niko
.
>>> df.loc['Niko']state TX
color green
food Lamb
age 2
height 70
score 8.3
Name: Niko, dtype: object
We now have a Series, where the old column names are now the index labels. The name
of the Series has become the old index label, Niko
in this case.
.loc
To select multiple rows, put all the row labels you want to select in a list and pass that to .loc
. Let's select Niko
and Penelope
.
>>> df.loc[['Niko', 'Penelope']]
.loc
It is possible to ‘slice’ the rows of a DataFrame with .loc
by using slice notation. Slice notation uses a colon to separate start, stop and step values. For instance, we can select all the rows from Niko
through Dean
like this:
>>> df.loc['Niko':'Dean']
.loc
includes the last value with slice notationNotice that the row labeled with Dean
was kept. In other data containers such as Python lists, the last value is excluded.
You can use slice notation similarly to how you use it with lists. Let’s slice from the beginning through Aaron
:
>>> df.loc[:'Aaron']
Slice from Niko
to Christina
stepping by 2:
>>> df.loc['Niko':'Christina':2]
Slice from Dean
to the end:
>>> df.loc['Dean':]
.loc
Unlike just the indexing operator, it is possible to select rows and columns simultaneously with .loc
. You do it by separating your row and column selections by a comma. It will look something like this:
>>>
df.loc[row_selection, column_selection]
For instance, if we wanted to select the rows Dean
and Cornelia
along with the columns age
, state
and score
we would do this:
>>> df.loc[['Dean', 'Cornelia'], ['age', 'state', 'score']]
.loc
Row or column selections can be any of the following as we have already seen:
We can use any of these three for either row or column selections with .loc
. Let's see some examples.
Let’s select two rows and a single column:
>>> df.loc[['Dean', 'Aaron'], 'food']Dean Cheese
Aaron Mango
Name: food, dtype: object
Select a slice of rows and a list of columns:
>>> df.loc['Jane':'Penelope', ['state', 'color']]
Select a single row and a single column. This returns a scalar value.
>>> df.loc['Jane', 'age']30
Select a slice of rows and columns
>>> df.loc[:'Dean', 'height':]
It is possible to select all of the rows by using a single colon. You can then select columns as normal:
>>> df.loc[:, ['food', 'color']]
You can also use this notation to select all of the columns:
>>> df.loc[['Penelope','Cornelia'], :]
But, it isn’t necessary as we have seen, so you can leave out that last colon:
>>> df.loc[['Penelope','Cornelia']]
It might be easier to assign row and column selections to variables before you use .loc
. This is useful if you are selecting many rows or columns:
>>> rows = ['Jane', 'Niko', 'Dean', 'Penelope', 'Christina']
>>> cols = ['state', 'age', 'height', 'score']
>>> df.loc[rows, cols]
.loc
.iloc
The .iloc
indexer is very similar to .loc
but only uses integer locations to make its selections. The word .iloc
itself stands for integer location so that should help with remembering what it does.
.iloc
By passing a single integer to .iloc
, it will select one row as a Series:
>>> df.iloc[3]state AL
color white
food Apple
age 4
height 80
score 3.3
Name: Penelope, dtype: object
.iloc
Use a list of integers to select multiple rows:
>>> df.iloc[[5, 2, 4]] # remember, don't do df.iloc[5, 2, 4]
.iloc
Slice notation works just like a list in this instance and is exclusive of the last element
>>> df.iloc[3:5]
Select 3rd position until end:
>>> df.iloc[3:]
Select 3rd position to end by 2:
>>> df.iloc[3::2]
.iloc
Just like with .iloc
any combination of a single integer, lists of integers or slices can be used to select rows and columns simultaneously. Just remember to separate the selections with a comma.
Select two rows and two columns:
>>> df.iloc[[2,3], [0, 4]]
Select a slice of the rows and two columns:
>>> df.iloc[3:6, [1, 4]]
Select slices for both
>>> df.iloc[2:5, 2:5]
Select a single row and column
>>> df.iloc[0, 2]'Steak'
Select all the rows and a single column
>>> df.iloc[:, 5]Jane 4.6
Niko 8.3
Aaron 9.0
Penelope 3.3
Dean 1.8
Christina 9.5
Cornelia 2.2
Name: score, dtype: float64
.ix
Early in the development of pandas, there existed another indexer, ix
. This indexer was capable of selecting both by label and by integer location. While it was versatile, it caused lots of confusion because it's not explicit. Sometimes integers can also be labels for rows or columns. Thus there were instances where it was ambiguous.
You can still call .ix
, but it has been deprecated, so please never use it.
We can also, of course, do subset selection with a Series. Earlier I recommended using just the indexing operator for column selection on a DataFrame. Since Series do not have columns, I suggest using only .loc
and .iloc
. You can use just the indexing operator, but its ambiguous as it can take both labels and integers. I will come back to this at the end of the tutorial.
Typically, you will create a Series by selecting a single column from a DataFrame. Let’s select the food
column:
>>> food = df['food']
>>> foodJane Steak
Niko Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
.loc
Series selection with .loc
is quite simple, since we are only dealing with a single dimension. You can again use a single row label, a list of row labels or a slice of row labels to make your selection. Let's see several examples.
Let’s select a single value:
>>> food.loc['Aaron']'Mango'
Select three different values. This returns a Series:
>>> food.loc[['Dean', 'Niko', 'Cornelia']]Dean Cheese
Niko Lamb
Cornelia Beans
Name: food, dtype: object
Slice from Niko
to Christina
- is inclusive of the last index
>>> food.loc['Niko':'Christina']Niko Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Name: food, dtype: object
Slice from Penelope
to the end:
>>> food.loc['Penelope':]Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
Select a single value in a list which returns a Series
>>> food.loc[['Aaron']]Aaron Mango
Name: food, dtype: object
.iloc
Series subset selection with .iloc
happens similarly to .loc
except it uses integer location. You can use a single integer, a list of integers or a slice of integers. Let's see some examples.
Select a single value:
>>> food.iloc[0]'Steak'
Use a list of integers to select multiple values:
>>> food.iloc[[4, 1, 3]]Dean Cheese
Niko Lamb
Penelope Apple
Name: food, dtype: object
Use a slice — is exclusive of the last integer
>>> food.iloc[4:6]Dean Cheese
Christina Melon
Name: food, dtype: object
It may be helpful to compare pandas ability to make selections by label and integer location to that of Python lists and dictionaries.
Python lists allow for the selection of data only through integer location. You can use a single integer or slice notation to make the selection but NOT a list of integers.
Let’s see examples of subset selection of lists using integers:
>>> some_list = ['a', 'two', 10, 4, 0, 'asdf', 'mgmt', 434, 99]
>>> some_list[5]'asdf'
>>> some_list[-1]99
>>> some_list[:4]['a', 'two', 10, 4]
>>> some_list[3:][4, 0, 'asdf', 'mgmt', 434, 99]
>>> some_list[2:6:3][10, 'asdf']
All values in each dictionary are labeled by a key. We use this key to make single selections. Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.
>>> d = {'a':1, 'b':2, 't':20, 'z':26, 'A':27}
>>> d['a']1
>>> d['A']27
DataFrames and Series are able to make selections with integers like a list and with labels like a dictionary.
There are a few more items that are important and belong in this tutorial and will be mentioned now.
Above, I used just the indexing operator to select a column or columns from a DataFrame. But, it can also be used to select rows using a slice. This behavior is very confusing in my opinion. The entire operation changes completely when a slice is passed.
Let’s use an integer slice as our first example:
>>> df[3:6]
To add to this confusion, you can slice by labels as well.
>>> df['Aaron':'Christina']
This feature is not deprecated and completely up to you whether you wish to use it. But, I highly prefer not to select rows in this manner as can be ambiguous, especially if you have integers in your index.
Using .iloc
and .loc
is explicit and clearly tells the person reading the code what is going to happen. Let's rewrite the above using .iloc
and .loc
.
>>> df.iloc[3:6] # More explicit that df[3:6]
>>> df.loc['Aaron':'Christina']
[]
An exception will be raised if you try and select rows and columns simultaneously with just the indexing operator. You must use .loc
or .iloc
to do so.
>>> df[3:6, 'Aaron':'Christina']TypeError: unhashable type: 'slice'
You can also use just the indexing operator with a Series. Again, this is confusing because it can accept integers or labels. Let’s see some examples
>>> foodJane Steak
Niko Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
>>> food[2:4]Aaron Mango
Penelope Apple
Name: food, dtype: object
>>> food['Niko':'Dean']Niko Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Name: food, dtype: object
Since Series don’t have columns you can use a single label and list of labels to make selections as well
>>> food['Dean']'Cheese'
>>> food[['Dean', 'Christina', 'Aaron']]Dean Cheese
Christina Melon
Aaron Mango
Name: food, dtype: object
Again, I recommend against doing this and always use .iloc
or .loc
We imported data by choosing the first column to be the index with the index_col
parameter of the read_csv
function. This is not typically how most DataFrames are read into pandas.
Usually, all the columns in the csv file become DataFrame columns. Pandas will use the integers 0 to n-1 as the labels. See the example data below with a slightly different dataset:
>>> df2 = pd.read_csv('data/sample_data2.csv')
>>> df2
RangeIndex
If you don’t specify a column to be the index when first reading in the data, pandas will use the integers 0 to n-1 as the index. This technically creates a RangeIndex
object. Let's take a look at it.
>>> df2.indexRangeIndex(start=0, stop=7, step=1)
This object is similar to Python range
objects. Let's create one:
>>> range(7)range(0, 7)
Converting both of these objects to a list produces the exact same thing:
>>> list(df2.index)[0, 1, 2, 3, 4, 5, 6]
>>> list(range(7))[0, 1, 2, 3, 4, 5, 6]
For now, it’s not at all important that you have a RangeIndex
. Selections from it happen just the same with .loc
and .iloc
. Let's look at some examples.
>>> df2.loc[[2, 4, 5], ['food', 'color']]
>>> df2.iloc[[2, 4, 5], [3,2]]
There is a subtle difference when using a slice. .iloc
excludes the last value, while .loc
includes it:
>>> df2.iloc[:3]
>>> df2.loc[:3]
It is common to see pandas code that reads in a DataFrame with a RangeIndex and then sets the index to be one of the columns. This is typically done with the s
et_index
method:
>>> df2_idx = df2.set_index('Names')
>>> df2_idx
Notice that this DataFrame does not look exactly like our first one from the very top of this tutorial. Directly above the index is the bold-faced wordNames
. This is technically the name of the index. Our original DataFrame had no name for its index. You can ignore this small detail for now. Subset selections will happen in the same fashion.
Pandas allows you to select a single column as a Series by using dot notation. This is also referred to as attribute access. You simply place the name of the column without quotes following a dot and the DataFrame like this:
>>> df.stateJane NY
Niko TX
Aaron FL
Penelope AL
Dean AK
Christina TX
Cornelia TX
Name: state, dtype: object
>>> df.ageJane 30
Niko 2
Aaron 12
Penelope 4
Dean 32
Christina 33
Cornelia 69
Name: age, dtype: int64
The best benefit of selecting columns like this is that you get help when chaining methods after selection. For instance, if you place another dot after the column name and press tab, a list of all the Series methods will appear in a pop-up menu. It will look like this:
This help disappears when you use just the indexing operator:
The biggest drawback is that you cannot select columns that have spaces or other characters that are not valid as Python identifiers (variable names).
This is rather peculiar, but you can actually select the same column more than once:
df[['age', 'age', 'age']]
We covered an incredible amount of ground. Let’s summarize all the main points:
[]
, .loc
and .iloc
[]
immediately following a DataFrame/Series.loc
makes selections only by label.loc
can simultaneously select rows and columns.loc
can make selections with either a single label, a list of labels, or a slice of labels.loc
makes row selections first followed by column selections:df.loc[row_selection, col_selection]
.iloc
is analogous to .loc
but uses only integer location to refer to rows or columns..ix
is deprecated and should never be used.loc
and .iloc
work the same for Series except they only select based on the index as there are no columns.loc
and .iloc
set_index
method to use a column as an index.This is only part 1 of the series, so there is much more to cover on how to select subsets of data in pandas. Some of the explanations in this part will be expanded to include other possibilities.
Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains:
Upon registration, you'll get access to the following free courses: