In this article, I will present an ‘optimal’ solution to Dunder Data Challenge #3. Please refer to that article for the problem setup. Work on this challenge directly in a Jupyter Notebook right now by clicking this link.
The naive solution was presented in detail in the previous article. The end result was a massive custom function containing many boolean filters used to find specific subsets of data to aggregate. For each group, a Series was returned with 11 values. Each of these values became a new column in the resulting DataFrame. Let’s take a look at the custom function:
Our performance using this naive solution takes nearly 4 seconds.
In order to greatly increase our performance, we need to take advantage of the built-in methods available to groupby objects. Above, we used a custom function to do many, many calculations. These calculations were performed on each group. Let’s get the total number of groups.
>>> len(df.groupby(['country', 'region']))
520
For each of these 520 groups, the massive f_final
function was called, recomputing each filter. Running these same calculations for each group is one of the main causes of poor performance with apply
.
Take a look at all the filters in f_final
. You'll notice that each of them are independent on the particular group. This means that we can calculate these filters before grouping and get the same result. As a concrete example, take a look at the following DataFrame.
df_rev = pd.DataFrame({'state': ['TX', 'TX', 'TX', 'CA', 'CA',
'CA'],
'category': ['tech', 'energy', 'energy',
'tech', 'energy', 'energy'],
'revenue' : [10, 5, 8, 20, 12, 2]})
Let’s calculate the revenue for each state, but just for the energy category. A naive solution involves writing a custom function, where each group will be filtered for just the energy category and then have the revenue summed.
def f(x):
is_energy = x['category'] == 'energy'
return x.loc[is_energy, 'revenue'].sum()
Using this custom function with apply returns the correct revenue for each state’s energy category.
>>> df_rev.groupby('state').apply(f)
state
CA 14
TX 13
Instead, we can create an entire new column for just the energy revenue. First, we create a boolean Series where True
corresponds to 'energy'. We multiply this Series by the original revenue column. Because False
evaluates to 0 and True
evaluates as 1, the new column will be just like the original, but have the 0 everywhere the category is not energy.
filt = df_rev['category'] == 'energy'
df_rev['energy_revenue'] = filt * df_rev['revenue']
df_rev
We can now use the built-in sum
method instead of our custom function during the grouping. There is no need for apply
here.
>>> df_rev.groupby('state')['energy_revenue'].sum()
state
CA 14
TX 13
where
methodInstead of replacing the filtered values with 0, as we did above, you might need to make them missing. This is crucial if you are calculating something like the mean or median, which will take into account the value of 0. The where
method will replace the False
values of the passed boolean Series with NaN.
filt = df_rev['category'] == 'energy'
df_rev['energy_revenue'] = df_rev['revenue'].where(filt)
df_rev
We can call the same groupby to get the same result.
>>> df_rev.groupby('state')['energy_revenue'].sum()
state
CA 14.0
TX 13.0
Let’s calculate rev_2019
which is defined as the revenue during the first half of 2019. Let's use the naive way of thinking first by defining a custom function.
>>> def get_rev_2019(x):
is_2019H1 = x['date'].between('2019-01-01', '2019-06-30')
return x.loc[is_2019H1, 'revenue'].sum()
>>> df.groupby(['country', 'region']).apply(get_rev_2019).head()country region
Argentina A 150508
B 139048
C 118035
D 131728
E 146201
Now, let’s use our new method of applying the filter to the entire DataFrame first, creating a new column, and then using the built-in sum
method.
>>> is_2019H1 = df['date'].between('2019-01-01', '2019-06-30')
>>> df['rev_2019'] = df['revenue'].where(is_2019H1)
>>> df.groupby(['country', 'region'])['rev_2019'].sum().head()
country region
Argentina A 150508.0
B 139048.0
C 118035.0
D 131728.0
E 146201.0
Let’s compare the performance between the two methods.
>>> %timeit df.groupby(['country', 'region']).apply(get_rev_2019)
633 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
>>> is_2019H1 = df['date'].between('2019-01-01', '2019-06-30')
>>> df['rev_2019'] = df['revenue'].where(is_2019H1)
>>> df.groupby(['country', 'region'])['rev_2019'].sum()
27.3 ms ± 530 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The custom function is 25x slower than the built-in method and this is just a simple calculation. The more complex the custom function, the larger the performance difference becomes.
If you are attempting to avoid using apply, then you have no choice but to use the built-in groupby methods. This limits the possibilities and forces you to approach the problem differently. The main ‘trick’ is to execute operations to the entire DataFrame before using the groupby method.
Not all operations will be able to be executed on the entire DataFrame, only those that are independent of the group. So, how do you know if an operation is independent of the group? The operation will not contain a calculation that is specific to the current group. For instance, we are grouping by country and region. If an operation is dependent on the particular country or region, then it would not be able to be executed on the entire DataFrame.
A concrete example can help here — If the definition of the first half of the year was January through July for Greece and January through June for all other countries, then the calculation of revenue for the first half of the year would depend on the group.
In this challenge, all the operations are independent of the group. There are no special cases based on the group. This means that we can execute all of our operations that we used within the custom function passed to apply
outside of it before we group.
The complete optimal solution will now be given. We will use the same definition for our filters as we did in the custom function, but instead calculate them on the entire DataFrame. We will then create new columns that have NaN where the filter is False
.
We can now use just the built-in groupby methods to aggregate the data.
This is not the final DataFrame, as some columns can only be calculated from the result of the aggregated values. We also need to drop some of these intermediate columns that are no longer desired in the result.
Let’s verify that the DataFrames are equivalent.
>>> df1 = df.groupby(['country', 'region']).apply(f_final)
>>> df1.equals(df_new1.astype('float'))
True
Let’s put all the steps of the optimal solution into a single function, which we can then use to measure performance.
def optimal():
... all steps
The optimal solution is about 20x as fast as the naive solution due to pre-calculating new columns on the entire DataFrame and only using built-in groupby methods.
>>> %timeit optimal()
201 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I suggest avoiding apply at all costs as the performance can be badly impacted by its use. The following steps summarize the procedure for avoiding apply:
Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains:
Upon registration, you'll get access to the following free courses: