This article summarizes the very detailed guide presented in Minimally Sufficient Pandas.
Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.
Use the brackets and not dot notation to select a single column of data because the dot notation cannot column names with spaces, those that collide with DataFrame methods and when the column name is a variable.
>>> df[‘colname’] # do this
>>> df.colname # not that
ix
indexerThe ix
indexer is ambiguous and confusing (and now deprecated) as it allows selection by both label and integer location. Every trace of ix
should be removed and replaced with the explicit loc
or iloc
indexers.
The at
and iat
indexers give a small increase in performance when selecting a single DataFrame cell. Use NumPy arrays if your application relies on performance for selecting a single cell of data and not at
or iat
.
read_csv
vs read_table
The only difference between these two functions is the default delimiter. Use read_csv
for all cases as read_table
is deprecated.
isna
vs isnull
and notna
vs notnull
isna
is an alias of isnull
and notna
is an alias of notnull
. Use isna
and notna
as they end with ‘na’ like the other missing value methods fillna
and dropna
.
Use the operators( +
, *
, >
, <=
, etc..) and not their corresponding methods ( add
, mul
, gt
, le
, etc…) in all cases except when absolutely necessary such as when you need to change the direction of the alignment.
Use the Pandas method over any built-in Python function with the same name.
groupby aggregation
There are a few different syntaxes available to do a groupby
aggregation. Use df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})
as it can handle more complex cases.
A DataFrame with a MultiIndex offers little benefit over one with a single-level index. I advise against using them. Instead, flatten them after a call to groupby
by renaming columns and resetting the index.
groupby
aggregation and pivot_table
A groupby
aggregation and a pivot_table
produce the same exact data with a different shape. Use gropuby
when you want to continue an analysis and pivot_table
when you want to compare groups.
The pivot_table
method and the crosstab
function are very similar. Only use crosstab
when finding the relative frequency.
The pivot
method pivots data without aggregating. It is possible to duplicate its functionality with pivot_table
by selecting an aggregation function. Consider using only pivot_table
and not pivot
.
Both the melt
and stack
methods reshape the data in a very similar manner. Use melt
over stack
because it allows you to rename columns and it avoids a MultiIndex.
pivot
and unstackBoth pivot
and unstack
work reshape data similarly but from above, pivot_table
can handle all cases that pivot
can, so I suggest using it over both of the others.
The above examples are the most common areas of Pandas where multiple options are available to its users. There are many other attributes and methods that are not discussed. Below, I provide a categorized list of the minimum amount of DataFrame attributes and methods that can accomplish nearly all of your data analysis tasks. It reduces the number from over 240 to less than 80.
These result in a single value for each column
Missing Value Handling
Upon registration, you'll get access to the following free courses: