In this tutorial, you'll learn how to create a bar chart race animation such as the one below using the matplotlib data visualization library in python.
bar_chart_race
python package¶Along with this tutorial is the release of the python package bar_chart_race that automates the process of making these animations. This post explains the procedure from scratch.
A bar chart race is an animated sequence of bars that show data values at different moments in time. The bars re-position themselves at each time period so that they remain in order (either ascending or descending).
The trick to making a bar chart race is to transition the bars slowly to their new position when their order changes, allowing you to easily track the movements.
For this bar chart race, we'll use a small dataset produced by John Hopkins University containing the total deaths by date for six countries during the currently ongoing coronavirus pandemic. Let's read it in now.
import pandas as pd
df = pd.read_csv('data/covid19.csv', index_col='date', parse_dates=['date'])
df.tail()
For this tutorial, the data must be in 'wide' form where:
Let's begin by creating a single static bar chart for the specific date of March 29, 2020. First, we select the data as a Series.
s = df.loc['2020-03-29']
s
We'll make a horizontal bar chart with matplotlib using the country names as the y-values and total deaths as the x-values (width of bars). Every bar will be a different color from the 'Dark2' colormap.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4, 2.5), dpi=144)
colors = plt.cm.Dark2(range(6))
y = s.index
width = s.values
ax.barh(y=y, width=width, color=colors);
The function below changes several properties of the axes to make it look nicer.
def nice_axes(ax):
ax.set_facecolor('.8')
ax.tick_params(labelsize=8, length=0)
ax.grid(True, axis='x', color='white')
ax.set_axisbelow(True)
[spine.set_visible(False) for spine in ax.spines.values()]
nice_axes(ax)
fig
For a bar chart race, the bars are often ordered from largest to smallest with the largest at the top. Here, we plot three days of data sorting each one first.
fig, ax_array = plt.subplots(nrows=1, ncols=3, figsize=(7, 2.5),
dpi=144, tight_layout=True)
dates = ['2020-03-29', '2020-03-30', '2020-03-31']
for ax, date in zip(ax_array, dates):
s = df.loc[date].sort_values()
ax.barh(y=s.index, width=s.values, color=colors)
ax.set_title(date, fontsize='smaller')
nice_axes(ax)
Although the bars are ordered properly, the countries do not keep their original color when changing places in the graph. Notice that the USA begins as the fifth bar and moves up one position each date, changing colors each time.
Instead of sorting, use the rank
method to find the numeric ranking of each country for each day. We use the 'first'
method of ranking so that each numeric rank is a unique integer. By default, the method is 'average'
which ranks ties with the same value causing overlapping bars. Let's see the ranking for the March 29, 2020.
df.loc['2020-03-29'].rank(method='first')
We now use this rank as the y-values. The order of the data in the Series never changes this way, ensuring that countries remain the same color regardless of their rank.
fig, ax_array = plt.subplots(nrows=1, ncols=3, figsize=(7, 2.5),
dpi=144, tight_layout=True)
dates = ['2020-03-29', '2020-03-30', '2020-03-31']
for ax, date in zip(ax_array, dates):
s = df.loc[date]
y = df.loc[date].rank(method='first')
ax.barh(y=y, width=s.values, color=colors, tick_label=s.index)
ax.set_title(date, fontsize='smaller')
nice_axes(ax)
Using each day as a single frame in an animation won't work well as it doesn't capture the transition from one time period to the next. In order to transition the bars that change positions, we'll need to add extra rows of data between the dates that we do have. Let's first select the three dates above as a DataFrame.
df2 = df.loc['2020-03-29':'2020-03-31']
df2
It's easier to insert an exact number of new rows when using the default index - integers beginning at 0. Alternatively, if you do have a datetime in the index as we do here, you can use the asfreq
method, which is explained at the end of this post. Use the reset_index
method to get a default index and to place the dates as a column again.
df2 = df2.reset_index()
df2
We want to insert new rows between the first and second rows and between the second and third rows. Begin by multiplying the index by the number of steps to transition from one time period to the next. We use 5 in this example.
df2.index = df2.index * 5
df2
reindex
¶To insert the additional rows, pass the reindex
method a sequence of all integers beginning at 0 to the last integer (10 in this case). pandas inserts new rows of all missing values for every index not in the current DataFrame.
last_idx = df2.index[-1] + 1
df_expanded = df2.reindex(range(last_idx))
df_expanded
The date for the missing rows is the same for each. Let's fill them in using the last known value with the fillna
method and set it as the index again.
df_expanded['date'] = df_expanded['date'].fillna(method='ffill')
df_expanded = df_expanded.set_index('date')
df_expanded
We also need a similar DataFrame that contains the rank of each country by row. Most pandas methods work down each column by default. Set axis
to 1 to change the direction of the operation so that values in each row are ranked against each other.
df_rank_expanded = df_expanded.rank(axis=1, method='first')
df_rank_expanded
The interpolate
method can fill in the missing values in a variety of ways. By default, it uses linear interpolation and works column-wise.
df_expanded = df_expanded.interpolate()
df_expanded
We also need to interpolate the ranking.
df_rank_expanded = df_rank_expanded.interpolate()
df_rank_expanded
The interpolated ranks will serve as the new position of the bars along the y-axis. Here, we'll plot each step from the first to the second day where Iran and the USA change place.
fig, ax_array = plt.subplots(nrows=1, ncols=6, figsize=(12, 2),
dpi=144, tight_layout=True)
labels = df_expanded.columns
for i, ax in enumerate(ax_array.flatten()):
y = df_rank_expanded.iloc[i]
width = df_expanded.iloc[i]
ax.barh(y=y, width=width, color=colors, tick_label=labels)
nice_axes(ax)
ax_array[0].set_title('2020-03-29')
ax_array[-1].set_title('2020-03-30');
The next day's transition is plotted below.
fig, ax_array = plt.subplots(nrows=1, ncols=6, figsize=(12, 2),
dpi=144, tight_layout=True)
labels = df_expanded.columns
for i, ax in enumerate(ax_array.flatten(), start=5):
y = df_rank_expanded.iloc[i]
width = df_expanded.iloc[i]
ax.barh(y=y, width=width, color=colors, tick_label=labels)
nice_axes(ax)
ax_array[0].set_title('2020-03-30')
ax_array[-1].set_title('2020-03-31');
We can copy and paste the code above into a function to automate the process of preparing any data for the bar chart race. Then use it to create two final DataFrames of all the data needed for plotting.
def prepare_data(df, steps=5):
df = df.reset_index()
df.index = df.index * steps
last_idx = df.index[-1] + 1
df_expanded = df.reindex(range(last_idx))
df_expanded['date'] = df_expanded['date'].fillna(method='ffill')
df_expanded = df_expanded.set_index('date')
df_rank_expanded = df_expanded.rank(axis=1, method='first')
df_expanded = df_expanded.interpolate()
df_rank_expanded = df_rank_expanded.interpolate()
return df_expanded, df_rank_expanded
df_expanded, df_rank_expanded = prepare_data(df)
df_expanded.head()
df_rank_expanded.head()
We are now ready to create the animation. Each row represents a single frame in our animation and will slowly transition the bars y-value location and width from one day to the next.
The simplest way to do animation in matplotlib is to use FuncAnimation
. You must define a function that updates the matplotlib axes object each frame. Because the axes object keeps all of the previous bars, we remove them in the beginning of the update
function. The rest of the function is identical to the plotting from above. This function will be passed the index of the frame as an integer. We also set the title to have the current date.
Optionally, you can define a function that initializes the axes. Below, init
clears the previous axes of all objects and then resets its nice properties.
Pass the figure (containing your axes), the update
and init
functions, and number of frames to FuncAnimation
. We also pass the number of milliseconds between each frame. We use 100 milliseconds per frame equating to 500 per day (half of a second).
The figure and axes are created separately below so they do not get output in a Jupyter Notebook, which automatically happens if you call plt.subplots
.
from matplotlib.animation import FuncAnimation
def init():
ax.clear()
nice_axes(ax)
ax.set_ylim(.2, 6.8)
def update(i):
for bar in ax.containers:
bar.remove()
y = df_rank_expanded.iloc[i]
width = df_expanded.iloc[i]
ax.barh(y=y, width=width, color=colors, tick_label=labels)
date_str = df_expanded.index[i].strftime('%B %-d, %Y')
ax.set_title(f'COVID-19 Deaths by Country - {date_str}', fontsize='smaller')
fig = plt.Figure(figsize=(4, 2.5), dpi=144)
ax = fig.add_subplot()
anim = FuncAnimation(fig=fig, func=update, init_func=init, frames=len(df_expanded),
interval=100, repeat=False)
Call the to_html5_video
method to return the animation as an HTML string and then embed it in the notebook with help from the IPython.display
module.
from IPython.display import HTML
html = anim.to_html5_video()
HTML(html)
You can save the animation to disk as an mp4 file using the save
method. Since we have an init
function, we don't have to worry about clearing our axes and resetting the limits. It will do it for us.
anim.save('media/covid19.mp4')
bar_chart_race
¶I created the bar_chart_race
python package to automate this process. It creates bar chart races from wide pandas DataFrames. Install with pip install bar_chart_race
. Read all of the documentation here.
import bar_chart_race as bcr
html = bcr.bar_chart_race(df, figsize=(4, 2.5), title='COVID-19 Deaths by Country')
HTML(html)
asfreq
¶If you are familiar with pandas, you might know that the asfreq
method can be used to insert new rows whenever you have a datetime index. Let's reselect the last three days of March again to show how it works.
df2 = df.loc['2020-03-29':'2020-03-31']
df2
Inserting new rows is actually easier with asfreq
. We just need to supply it a date offset that is a multiple of 24 hours. Here, we insert a new row every 6 hours.
df2.asfreq('6h')
Inserting a specific number of rows is a little trickier, but possible by creating a date range first, which allows you specify the total number of periods (still using 5 in this example), which you must calculate.
num_periods = (len(df2) - 1) * 5 + 1
dr = pd.date_range(start='2020-03-29', end='2020-03-31', periods=num_periods)
dr
Pass this date range to reindex
to achieve the same result.
df2.reindex(dr)
We can use this procedure on all of our data.
num_periods = (len(df) - 1) * 5 + 1
dr = pd.date_range(start=df.index[0], end=df.index[-1], periods=num_periods)
df_expanded = df.reindex(dr)
df_rank_expanded = df_expanded.rank(axis=1).interpolate()
df_expanded = df_expanded.interpolate()
df_expanded.iloc[160:166]
df_rank_expanded.iloc[160:166]
It's possible to do all of the analysis in a single ugly line of code.
df_one = df.reset_index() \
.reindex([i / 5 for i in range(len(df) * 5 - 4)]) \
.reset_index(drop=True) \
.pipe(lambda x: pd.concat([x, x.iloc[:, 1:].rank(axis=1)], axis=1,
keys=['values', 'ranks'])) \
.interpolate() \
.fillna(method='ffill') \
.set_index(('values', 'date')) \
.rename_axis(index='date')
df_one.head()
If you are looking for a single, comprehensive resources to master pandas, matplotlib, and seaborn, check out my book Master Data Analysis with Python. It contains 800 pages and 500 exercises with detailed solutions. If you want to be a trusted source to do data analysis using Python, this book will ensure you get there.
Upon registration, you'll get access to the following free courses: