Dunder Data Challenge #3 - Optimal Solution

dunder data challenges Sep 17, 2019

In this article, I will present an ‘optimal’ solution to Dunder Data Challenge #3. Please refer to that article for the problem setup. Work on this challenge directly in a Jupyter Notebook right now by clicking this link.

Naive Solution — Custom function with apply

The naive solution was presented in detail in the previous article. The end result was a massive custom function containing many boolean filters used to find specific subsets of data to aggregate. For each group, a Series was returned with 11 values. Each of these values became a new column in the resulting DataFrame. Let’s take a look at the custom function:

Our performance using this naive solution takes nearly 4 seconds.

Become an Expert

My book Master Data Analysis with Python is the most comprehensive text on the market to learn data analysis using Python and comes with 300+ exercises and projects.
Sign-up for the FREE Intro to Pandas class
Follow me on Twitter @TedPetrou for my daily data science tricks

Optimal Solution

In order to greatly increase our performance, we need to take advantage of the built-in methods available to groupby objects. Above, we used a custom function to do many, many calculations. These calculations were performed on each group. Let’s get the total number of groups.

>>> len(df.groupby(['country', 'region']))
520

For each of these 520 groups, the massive f_final function was called, recomputing each filter. Running these same calculations for each group is one of the main causes of poor performance with apply.

Take a look at all the filters in f_final. You'll notice that each of them are independent on the particular group. This means that we can calculate these filters before grouping and get the same result. As a concrete example, take a look at the following DataFrame.

df_rev = pd.DataFrame({'state': ['TX', 'TX', 'TX', 'CA', 'CA', 
                                 'CA'], 
                       'category': ['tech', 'energy', 'energy', 
                                    'tech', 'energy', 'energy'],
                       'revenue' : [10, 5, 8, 20, 12, 2]})

Let’s calculate the revenue for each state, but just for the energy category. A naive solution involves writing a custom function, where each group will be filtered for just the energy category and then have the revenue summed.

def f(x):
    is_energy = x['category'] == 'energy'
    return x.loc[is_energy, 'revenue'].sum()

Using this custom function with apply returns the correct revenue for each state’s energy category.

>>> df_rev.groupby('state').apply(f)
state
CA    14
TX    13

Instead, we can create an entire new column for just the energy revenue. First, we create a boolean Series where True corresponds to 'energy'. We multiply this Series by the original revenue column. Because False evaluates to 0 and True evaluates as 1, the new column will be just like the original, but have the 0 everywhere the category is not energy.

filt = df_rev['category'] == 'energy'
df_rev['energy_revenue'] = filt * df_rev['revenue']
df_rev

We can now use the built-in sum method instead of our custom function during the grouping. There is no need for apply here.

>>> df_rev.groupby('state')['energy_revenue'].sum()
state
CA    14
TX    13

Use the `where` method

Instead of replacing the filtered values with 0, as we did above, you might need to make them missing. This is crucial if you are calculating something like the mean or median, which will take into account the value of 0. The where method will replace the False values of the passed boolean Series with NaN.

filt = df_rev['category'] == 'energy'
df_rev['energy_revenue'] = df_rev['revenue'].where(filt)
df_rev

We can call the same groupby to get the same result.

>>> df_rev.groupby('state')['energy_revenue'].sum()
state
CA    14.0
TX    13.0

Filter with our challenge data

Let’s calculate rev_2019 which is defined as the revenue during the first half of 2019. Let's use the naive way of thinking first by defining a custom function.

>>> def get_rev_2019(x):
        is_2019H1 = x['date'].between('2019-01-01', '2019-06-30')
        return x.loc[is_2019H1, 'revenue'].sum()

>>> df.groupby(['country', 'region']).apply(get_rev_2019).head()country    region
Argentina  A         150508
           B         139048
           C         118035
           D         131728
           E         146201

Now, let’s use our new method of applying the filter to the entire DataFrame first, creating a new column, and then using the built-in sum method.

>>> is_2019H1 = df['date'].between('2019-01-01', '2019-06-30')
>>> df['rev_2019'] =  df['revenue'].where(is_2019H1)
>>> df.groupby(['country', 'region'])['rev_2019'].sum().head()
country    region
Argentina  A         150508.0
           B         139048.0
           C         118035.0
           D         131728.0
           E         146201.0

Custom function vs built-in method performance comparison

Let’s compare the performance between the two methods.

>>> %timeit df.groupby(['country', 'region']).apply(get_rev_2019)
633 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %%timeit
>>> is_2019H1 = df['date'].between('2019-01-01', '2019-06-30')
>>> df['rev_2019'] = df['revenue'].where(is_2019H1)
>>> df.groupby(['country', 'region'])['rev_2019'].sum()
27.3 ms ± 530 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The custom function is 25x slower than the built-in method and this is just a simple calculation. The more complex the custom function, the larger the performance difference becomes.

The trick to avoiding apply

If you are attempting to avoid using apply, then you have no choice but to use the built-in groupby methods. This limits the possibilities and forces you to approach the problem differently. The main ‘trick’ is to execute operations to the entire DataFrame before using the groupby method.

Not all operations will be able to be executed on the entire DataFrame, only those that are independent of the group. So, how do you know if an operation is independent of the group? The operation will not contain a calculation that is specific to the current group. For instance, we are grouping by country and region. If an operation is dependent on the particular country or region, then it would not be able to be executed on the entire DataFrame.

A concrete example can help here — If the definition of the first half of the year was January through July for Greece and January through June for all other countries, then the calculation of revenue for the first half of the year would depend on the group.

In this challenge, all the operations are independent of the group. There are no special cases based on the group. This means that we can execute all of our operations that we used within the custom function passed to apply outside of it before we group.

Complete optimal solution

The complete optimal solution will now be given. We will use the same definition for our filters as we did in the custom function, but instead calculate them on the entire DataFrame. We will then create new columns that have NaN where the filter is False.

We can now use just the built-in groupby methods to aggregate the data.

This is not the final DataFrame, as some columns can only be calculated from the result of the aggregated values. We also need to drop some of these intermediate columns that are no longer desired in the result.

Let’s verify that the DataFrames are equivalent.

>>> df1 = df.groupby(['country', 'region']).apply(f_final)
>>> df1.equals(df_new1.astype('float'))
True

Let’s put all the steps of the optimal solution into a single function, which we can then use to measure performance.

def optimal():
    ... all steps

About 20x faster

The optimal solution is about 20x as fast as the naive solution due to pre-calculating new columns on the entire DataFrame and only using built-in groupby methods.

>>> %timeit optimal()
201 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary of how to unapply apply

I suggest avoiding apply at all costs as the performance can be badly impacted by its use. The following steps summarize the procedure for avoiding apply:

Compute any calculations that are independent of each group before grouping. These calculations are applied to the entire DataFrame as a whole
Create new columns in your DataFrame that contain the result of these new calculations from step 1.
When grouping, only use the built-in groupby aggregation methods. Do not use a custom function.
You may have some aggregations that use the result of a groupby aggregation (taking the difference between mean and median for example). Calculate these columns after the groupby.

Master Python, Data Science and Machine Learning

Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains:

Master the Fundamentals of Python— A comprehensive introduction to Python (300+ pages, 150+ exercises)
Master Data Analysis with Python — The most comprehensive course available to learn pandas. (800+ pages and 500+ exercises)
Master Machine Learning with Python — A deep dive into doing machine learning with scikit-learn constantly updated to showcase the latest and greatest tools. (300+ pages)

Get the All Access Pass now!

Dunder Data Challenge #3 - Optimal Solution

Naive Solution — Custom function with apply

Become an Expert

Optimal Solution

Use the `where` method

Filter with our challenge data

Custom function vs built-in method performance comparison

The trick to avoiding apply

Complete optimal solution

About 20x faster

Summary of how to unapply apply

Master Python, Data Science and Machine Learning

Master Data Analysis with Python

Start learning Data Science using Python with our free courses!

Register for a free account

Dunder Data Challenge #3 - Optimal Solution

Naive Solution — Custom function with apply

Become an Expert

Optimal Solution

Use the where method

Filter with our challenge data

Custom function vs built-in method performance comparison

The trick to avoiding apply

Complete optimal solution

About 20x faster

Summary of how to unapply apply

Master Python, Data Science and Machine Learning

Master Data Analysis with Python

Start learning Data Science using Python with our free courses!

Register for a free account

Use the `where` method