An Overview of Pandas GroupBy

In [1]:
import numpy as np
import pandas as pd

This exercise is inspired by Wes McKinney's Python for Data Analysis

In [2]:
df = pd.read_csv("elections.csv")
df.head()
Out[2]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
2 Anderson Independent 6.6 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss

Grouping a Series by a Series

Let's group the % Series by the Party Series. A call to groupby does that, but what is the object that results?

In [3]:
percent_grouped_by_party = df['%'].groupby(df['Party'])
type(percent_grouped_by_party)
Out[3]:
pandas.core.groupby.groupby.SeriesGroupBy

As we see, percent_grouped_by_party is NOT a familiar object like a DataFrame, Series, or Index. Instead, it's a SeriesGroupBy object. A SeriesGroupBy consists of groups, one for each of the distinct values of the Party column. If we ask to see these groups, we'll be able to see which indices in the original DataFrame correspond to each group.

In [4]:
percent_grouped_by_party.groups
Out[4]:
{'Democratic': Int64Index([1, 4, 6, 7, 10, 13, 15, 17, 19, 21], dtype='int64'),
 'Independent': Int64Index([2, 9, 12], dtype='int64'),
 'Republican': Int64Index([0, 3, 5, 8, 11, 14, 16, 18, 20, 22], dtype='int64')}

The percent_grouped_by_party object is capable of making computations across all these groups. For example, if we call the mean method of the SeriesGroupBy class, we'll get a new Series containing the mean of the "Democratic" Series, the mean of the "Independent" Series, and the mean of the "Republican" Series.

In [5]:
percent_grouped_by_party.mean()
Out[5]:
Party
Democratic     46.53
Independent    11.30
Republican     47.86
Name: %, dtype: float64

The output of the mean methood is a regular ole pandas Series.

In [6]:
type(percent_grouped_by_party.mean())
Out[6]:
pandas.core.series.Series

SeriesGroupBy objects have many other handy methods, e.g. max and min.

In [7]:
percent_grouped_by_party.max()
Out[7]:
Party
Democratic     52.9
Independent    18.9
Republican     58.8
Name: %, dtype: float64
In [8]:
percent_grouped_by_party.min()
Out[8]:
Party
Democratic     37.6
Independent     6.6
Republican     37.4
Name: %, dtype: float64
In [9]:
percent_grouped_by_party.size()
Out[9]:
Party
Democratic     10
Independent     3
Republican     10
Name: %, dtype: int64
In [10]:
percent_grouped_by_party.first()
Out[10]:
Party
Democratic     41.0
Independent     6.6
Republican     50.7
Name: %, dtype: float64

We can iterate over a SeriesGroupBy object -- though we're doing this just for educational purposes and you'll probably never actually do this with a real SeriesGroupBy. As we iterate we get pairs of (name, group), where name is a String label for the group, and group is a Series corresponding to the all the values from the given group.

In [11]:
from IPython.display import display  # like print, but for complex objects

for name, group in percent_grouped_by_party:
    print('Name:', name)
    print(type(group))
    display(group)
Name: Democratic
<class 'pandas.core.series.Series'>
1     41.0
4     37.6
6     45.6
7     43.0
10    49.2
13    48.4
15    48.3
17    52.9
19    51.1
21    48.2
Name: %, dtype: float64
Name: Independent
<class 'pandas.core.series.Series'>
2      6.6
9     18.9
12     8.4
Name: %, dtype: float64
Name: Republican
<class 'pandas.core.series.Series'>
0     50.7
3     58.8
5     53.4
8     37.4
11    40.7
14    47.9
16    50.7
18    45.7
20    47.2
22    46.1
Name: %, dtype: float64

Grouping a Series by Multiple Series

We can also group a Series by multiple Series. For example, suppose we want to track all combinations of {'Democratic', 'Republican', and 'Independent'} and {'win', 'loss'}.

In [12]:
percent_grouped_by_party_and_result = df['%'].groupby([df['Party'], df['Result']])
percent_grouped_by_party_and_result.groups
Out[12]:
{('Democratic', 'loss'): Int64Index([1, 4, 6, 13, 15, 21], dtype='int64'),
 ('Democratic', 'win'): Int64Index([7, 10, 17, 19], dtype='int64'),
 ('Independent', 'loss'): Int64Index([2, 9, 12], dtype='int64'),
 ('Republican', 'loss'): Int64Index([8, 11, 18, 20], dtype='int64'),
 ('Republican', 'win'): Int64Index([0, 3, 5, 14, 16, 22], dtype='int64')}

Given this groupby object, we can compute the average percentage earned every time each of the parties won and lost the presidential election. We see that at least between 1980 and 2016, the Republicans have typically lost and won their elections by wider margins.

In [13]:
percent_grouped_by_party_and_result.mean()
Out[13]:
Party        Result
Democratic   loss      44.850000
             win       49.050000
Independent  loss      11.300000
Republican   loss      42.750000
             win       51.266667
Name: %, dtype: float64

The careful reader will note that the returned object looks a little funny. It seems to have two indexes! If we check the type of this object, we'll see it's just a regular Series.

In [14]:
type(percent_grouped_by_party_and_result.mean())
Out[14]:
pandas.core.series.Series

However if we request to see the index of this Series, we see that it is a "MultiIndex", which is a special type of index used for data that is indexed in two or more ways.

In [15]:
percent_grouped_by_party_and_result.mean().index
Out[15]:
MultiIndex(levels=[['Democratic', 'Independent', 'Republican'], ['loss', 'win']],
           labels=[[0, 0, 1, 2, 2], [0, 1, 0, 0, 1]],
           names=['Party', 'Result'])

We won't go into MultiIndexes formally today, but be aware that it is possible to index into a Series that has a MultiIndex. The sytnax is about exactly as you'd expect, for example:

In [16]:
percent_grouped_by_party_and_result.mean()["Democratic"]
Out[16]:
Result
loss    44.85
win     49.05
Name: %, dtype: float64

Grouping a DataFrame by a Series

We can also group an entire dataframe by one or more Series. This results in a DataFrameGroupBy object as the result:

In [17]:
everything_grouped_by_party = df.groupby('Party')
everything_grouped_by_party
Out[17]:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x1146600f0>

As in our previous example, this object contains three group objects, one for each party label.

In [18]:
everything_grouped_by_party.groups
Out[18]:
{'Democratic': Int64Index([1, 4, 6, 7, 10, 13, 15, 17, 19, 21], dtype='int64'),
 'Independent': Int64Index([2, 9, 12], dtype='int64'),
 'Republican': Int64Index([0, 3, 5, 8, 11, 14, 16, 18, 20, 22], dtype='int64')}

Just as with SeriesGroupBy objects, we can iterate over a DataFrameGroupBy object to understand what is effectively inside.

In [19]:
for n, g in everything_grouped_by_party:
    print('name:', n)
    display(g)
name: Democratic
Candidate Party % Year Result
1 Carter Democratic 41.0 1980 loss
4 Mondale Democratic 37.6 1984 loss
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
10 Clinton Democratic 49.2 1996 win
13 Gore Democratic 48.4 2000 loss
15 Kerry Democratic 48.3 2004 loss
17 Obama Democratic 52.9 2008 win
19 Obama Democratic 51.1 2012 win
21 Clinton Democratic 48.2 2016 loss
name: Independent
Candidate Party % Year Result
2 Anderson Independent 6.6 1980 loss
9 Perot Independent 18.9 1992 loss
12 Perot Independent 8.4 1996 loss
name: Republican
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
3 Reagan Republican 58.8 1984 win
5 Bush Republican 53.4 1988 win
8 Bush Republican 37.4 1992 loss
11 Dole Republican 40.7 1996 loss
14 Bush Republican 47.9 2000 win
16 Bush Republican 50.7 2004 win
18 McCain Republican 45.7 2008 loss
20 Romney Republican 47.2 2012 loss
22 Trump Republican 46.1 2016 win

And just like SeriesGroupBy objects, we can apply methods like mean to compute the mean of each group. Since a DataFrameGroupBy is linked to the entire original dataframe (instead of to a single column from the dataframe), we calculate a mean for every numerical column. In this example below, we get the mean vote earned (as before), and the mean year (which isn't a useful quantity).

In [20]:
everything_grouped_by_party.mean()
Out[20]:
% Year
Party
Democratic 46.53 1998.000000
Independent 11.30 1989.333333
Republican 47.86 1998.000000

Where did all the other columns go in the mean above? They are nuisance columns, which get automatically eliminated from an operation where it doesn't make sense (such as a numerical mean).

Grouping a DataFrame by Multiple Series

DataFrames may also be grouped by multiple series at once. For example, we can repeat what we did with a Series above and group the entire DataFrame by Party and Result. After aggregation, we end up with a DataFrame that has a MultiIndex.

In [21]:
everything_grouped_by_party_and_result=df.groupby([df['Party'], df['Result']])
In [22]:
everything_grouped_by_party_and_result.max()
Out[22]:
Candidate % Year
Party Result
Democratic loss Mondale 48.4 2016
win Obama 52.9 2012
Independent loss Perot 18.9 1996
Republican loss Romney 47.2 2012
win Trump 58.8 2016

The resulting DataFrame above is pretty strange. We'll observe that Walter Mondale did not run for office in 2016! Make sure you understand why this is happening, as this sort of thing is a common mistake made by people who don't fully understand how pandas works.

Challenge: Try to figure out how to generate a table like the one above, except that each entry should represent all attributes of the candidate who got the maximum vote in that category. For example, the Republican Win category should feature Reagan in 1984 with 58.8% of the vote. The answer is a few cells below this one. There's a hint a few cells down.

Hint: Consider using sort_values and first somehow.

In [23]:
df.sort_values("%", ascending=False).groupby([df['Party'], df['Result']]).first()
Out[23]:
Candidate Party % Year Result
Party Result
Democratic loss Gore Democratic 48.4 2000 loss
win Obama Democratic 52.9 2008 win
Independent loss Perot Independent 18.9 1992 loss
Republican loss Romney Republican 47.2 2012 loss
win Reagan Republican 58.8 1984 win

The result of an aggregation function applied to a DataFrameGroupBy

Custom Aggregation Functions

As described above, both SeriesGroupBy and DataFrameGroupBy objects have lots of handy methods for computing aggregate values for groups, e.g.

In [24]:
percent_grouped_by_party.min()
Out[24]:
Party
Democratic     37.6
Independent     6.6
Republican     37.4
Name: %, dtype: float64
In [25]:
everything_grouped_by_party.median()
Out[25]:
% Year
Party
Democratic 48.25 1998
Independent 8.40 1992
Republican 47.55 1998

It turns out that all of these GroupBy methods are just shorthand for a more powerful and universal method of our GroupBy objects called agg. For example, .min() is just shorthand for .agg(min), where min refers to the function min.

In [26]:
percent_grouped_by_party.agg(min)
Out[26]:
Party
Democratic     37.6
Independent     6.6
Republican     37.4
Name: %, dtype: float64

Naturally, we can define our own custom aggregation functions. For example, the function below returns the first item in a series.

In [27]:
def average_of_first_and_last(series):
    return (series.iloc[0] + series.iloc[-1])/2

We can supply this function as a custom aggregation function for each series. As you can see, nuisance columns are automatically removed.

In [28]:
everything_grouped_by_party.agg(average_of_first_and_last)
Out[28]:
% Year
Party
Democratic 44.6 1998
Independent 7.5 1988
Republican 48.4 1998

agg is fundamental to our use of GroupBy objects. Indeed, you will rarely call groupby without also calling agg, at least implicitly. We can summarize the grouping process with the following diagram, inspired by a similar diagram created by DS100 instructor Joey Gonzales. Diagram source at this link.

groupby

The result of calling groupby then agg on a Series is also a Series, and the result of calling groupby then agg on a DataFrame is also typically a DataFrame, though there are exceptions, e.g. if you use the aggregation function size, you will get back a Series.

In [29]:
def list_of_first_and_last(series):
    return [series.iloc[0], series.iloc[-1]]
In [30]:
everything_grouped_by_party.agg(list_of_first_and_last)
Out[30]:
Candidate % Year Result
Party
Democratic [Carter, Clinton] [41.0, 48.2] [1980, 2016] [loss, loss]
Independent [Anderson, Perot] [6.6, 8.4] [1980, 1996] [loss, loss]
Republican [Reagan, Trump] [50.7, 46.1] [1980, 2016] [win, win]

Using Groups to Filter Datasets

In [31]:
everything_grouped_by_party = df.groupby('Party')
In [32]:
everything_grouped_by_party.filter(lambda subframe: subframe["Year"].max() < 2000)
Out[32]:
Candidate Party % Year Result
2 Anderson Independent 6.6 1980 loss
9 Perot Independent 18.9 1992 loss
12 Perot Independent 8.4 1996 loss
In [33]:
everything_grouped_by_year = df.groupby('Year')
In [34]:
everything_grouped_by_year.filter(lambda subframe: subframe["%"].sum() < 97)
Out[34]:
Candidate Party % Year Result
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
13 Gore Democratic 48.4 2000 loss
14 Bush Republican 47.9 2000 win
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

A visual picture of how filtering works is shown below.

filter

Using isin for filtering

This isn't actually related to groupby at all, but it seemed natural to bring it up here.

Sometimes we want to restrict our attention to only rows where certain values appear. For example, we saw last time how we'd look at only rows that contain "Democratic" candidates.

In [35]:
df[df["Party"] == "Democratic"]
Out[35]:
Candidate Party % Year Result
1 Carter Democratic 41.0 1980 loss
4 Mondale Democratic 37.6 1984 loss
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
10 Clinton Democratic 49.2 1996 win
13 Gore Democratic 48.4 2000 loss
15 Kerry Democratic 48.3 2004 loss
17 Obama Democratic 52.9 2008 win
19 Obama Democratic 51.1 2012 win
21 Clinton Democratic 48.2 2016 loss

Suppose we wanted to filter such that all Republicans and Democrats appeared. One ugly way to do this would be df[(df["Party"] == "Democratic") | (df["Party"] == "Republican")]. However, a better idea is to use the isin method.

In [36]:
df[(df["Party"] == "Democratic") | (df["Party"] == "Republican")]

df[df["Party"].isin(["Republican", "Democratic"])]
Out[36]:
Candidate Party % Year Result
0 Reagan Republican 50.7 1980 win
1 Carter Democratic 41.0 1980 loss
3 Reagan Republican 58.8 1984 win
4 Mondale Democratic 37.6 1984 loss
5 Bush Republican 53.4 1988 win
6 Dukakis Democratic 45.6 1988 loss
7 Clinton Democratic 43.0 1992 win
8 Bush Republican 37.4 1992 loss
10 Clinton Democratic 49.2 1996 win
11 Dole Republican 40.7 1996 loss
13 Gore Democratic 48.4 2000 loss
14 Bush Republican 47.9 2000 win
15 Kerry Democratic 48.3 2004 loss
16 Bush Republican 50.7 2004 win
17 Obama Democratic 52.9 2008 win
18 McCain Republican 45.7 2008 loss
19 Obama Democratic 51.1 2012 win
20 Romney Republican 47.2 2012 loss
21 Clinton Democratic 48.2 2016 loss
22 Trump Republican 46.1 2016 win

Grouping over a different dimension (bonus topic, less often useful)

Above, we've been grouping data along the rows, using column keys as our selectors. But we can also group along the columns, for example we can group by how many times the letter a appears in the column name.

In [37]:
grouped = df.groupby(lambda x: x.count('a'), axis=1)
for dtype, group in grouped:
    print(dtype)
    display(group)
0
% Result
0 50.7 win
1 41.0 loss
2 6.6 loss
3 58.8 win
4 37.6 loss
5 53.4 win
6 45.6 loss
7 43.0 win
8 37.4 loss
9 18.9 loss
10 49.2 win
11 40.7 loss
12 8.4 loss
13 48.4 loss
14 47.9 win
15 48.3 loss
16 50.7 win
17 52.9 win
18 45.7 loss
19 51.1 win
20 47.2 loss
21 48.2 loss
22 46.1 win
1
Party Year
0 Republican 1980
1 Democratic 1980
2 Independent 1980
3 Republican 1984
4 Democratic 1984
5 Republican 1988
6 Democratic 1988
7 Democratic 1992
8 Republican 1992
9 Independent 1992
10 Democratic 1996
11 Republican 1996
12 Independent 1996
13 Democratic 2000
14 Republican 2000
15 Democratic 2004
16 Republican 2004
17 Democratic 2008
18 Republican 2008
19 Democratic 2012
20 Republican 2012
21 Democratic 2016
22 Republican 2016
2
Candidate
0 Reagan
1 Carter
2 Anderson
3 Reagan
4 Mondale
5 Bush
6 Dukakis
7 Clinton
8 Bush
9 Perot
10 Clinton
11 Dole
12 Perot
13 Gore
14 Bush
15 Kerry
16 Bush
17 Obama
18 McCain
19 Obama
20 Romney
21 Clinton
22 Trump
In [38]:
grouped.max().head(5)
Out[38]:
0 1 2
0 50.7 1980 Reagan
1 41.0 1980 Carter
2 6.6 1980 Anderson
3 58.8 1984 Reagan
4 37.6 1984 Mondale
In [39]:
financial_data = pd.read_csv("financial_data.csv", index_col = 0)
In [40]:
financial_data.head(5)
Out[40]:
1/3/1984 2/1/1984 3/1/1984 4/2/1984 5/1/1984 6/1/1984 7/2/1984 8/1/1984 9/4/1984 10/1/1984 ... 3/1/2016 4/1/2016 5/1/2016 6/1/2016 7/1/2016 8/1/2016 9/1/2016 10/1/2016 11/1/2016 12/1/2016
Date
2 Year Treasury Yield 10.64 10.79 11.31 11.69 12.47 12.91 12.88 12.43 12.20 11.60 ... 0.88 0.77 0.82 0.73 0.67 0.74 0.77 0.84 0.98 1.20
30 Year Mortgage Rate 13.37 13.23 13.39 13.65 13.94 14.42 14.67 14.47 14.35 14.13 ... 3.69 3.61 3.60 3.57 3.44 3.44 3.46 3.47 3.77 4.20
Federal Funds Rate 9.56 9.59 9.91 10.29 10.32 11.06 11.23 11.64 11.30 9.99 ... 0.36 0.37 0.37 0.38 0.39 0.40 0.40 0.40 0.41 0.54

3 rows × 396 columns

In [41]:
def get_year(datestr):
    return datestr.split('/')[2]
In [42]:
grouped_by_year = financial_data.groupby(get_year, axis=1)
grouped_by_year.mean()
Out[42]:
1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Date
2 Year Treasury Yield 11.645833 9.271667 6.870833 7.406667 8.098333 8.575833 8.157500 6.4850 4.770833 4.048333 ... 4.357500 2.004167 0.956667 0.702500 0.451667 0.276667 0.307500 0.462500 0.685833 0.835833
30 Year Mortgage Rate 13.870000 12.424167 10.181667 10.195000 10.334167 10.324167 10.129167 9.2500 8.400000 7.325833 ... 6.342500 6.038333 5.041667 4.691667 4.457500 3.656667 3.983333 4.172500 3.849167 3.648333
Federal Funds Rate 10.225000 8.100833 6.805000 6.657500 7.568333 9.216667 8.099167 5.6875 3.521667 3.022500 ... 5.019167 1.927500 0.160000 0.175000 0.101667 0.140000 0.107500 0.089167 0.132500 0.395000

3 rows × 33 columns

Example from Lecture Figures

For your convenience, we've provided the dataframe from lecture for the groupby examples.

In [43]:
lex = pd.DataFrame({"name": ["A", "B", "C", "A", "B", "C", "A", "D", "B"],
             "num": [3, 1, 4, 1, 5, 9, 2, 5, 6]})
In [44]:
lex.groupby("name").filter(lambda sf: sf["num"].sum() > 10)
Out[44]:
name num
1 B 1
2 C 4
4 B 5
5 C 9
8 B 6

Pivot Tables

Recall from before that we were able to group the % Series by the "Party" and "Result" Series, allowing us to understand the average vote earned by each party under each election result.

In [45]:
percent_grouped_by_party_and_result = df['%'].groupby([df['Party'], df['Result']])
percent_grouped_by_party_and_result.mean()
Out[45]:
Party        Result
Democratic   loss      44.850000
             win       49.050000
Independent  loss      11.300000
Republican   loss      42.750000
             win       51.266667
Name: %, dtype: float64

Because we called groupby on a Series, the result of our aggregation operation was also a Series. However, I believe this data is more naturally expressed in a tabular format, with Party as the rows, and Result as the columns. The pivot_table operation is the natural way to achieve this data format.

In [46]:
df_pivot = df.pivot_table(
    index='Party', # the rows (turned into index)
    columns='Result', # the column values
    values='%', # the field(s) to processed in each group
    aggfunc=np.mean, # group operation
)
df_pivot.head()
Out[46]:
Result loss win
Party
Democratic 44.85 49.050000
Independent 11.30 NaN
Republican 42.75 51.266667

The basic idea is that you specify a Series to be the index (i.e. rows) and a Series to be the columns. The data in the specified values is then grouped by all possible combinations of values that occur in the index and columns Series. These groups are then aggregated using the aggfunc, and arranged into a table that matches the requested index and columns. The diagram below summarizes how pivot tables are formed. (Diagram inspired by Joey Gonzales). Diagram source at this link.

groupby

For more on pivot tables, see this excellent tutorial by Chris Moffitt.

List Arguments to pivot_table (Extra)

The arguments to our pivot_table method can also be lists. A few examples are given below.

If we pivot such that only our columns argument is a list, we end up with columns that are MultiIndexed.

In [47]:
df_pivot = df.pivot_table(
    index='Party', # the rows (turned into index)
    columns=['Result', 'Candidate'], # the column values
    values='%', # the field(s) to processed in each group
    aggfunc=np.mean, # group operation
)
df_pivot.head()
Out[47]:
Result loss win
Candidate Anderson Bush Carter Clinton Dole Dukakis Gore Kerry McCain Mondale Perot Romney Bush Clinton Obama Reagan Trump
Party
Democratic NaN NaN 41.0 48.2 NaN 45.6 48.4 48.3 NaN 37.6 NaN NaN NaN 46.1 52.0 NaN NaN
Independent 6.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN 13.65 NaN NaN NaN NaN NaN NaN
Republican NaN 37.4 NaN NaN 40.7 NaN NaN NaN 45.7 NaN NaN 47.2 50.666667 NaN NaN 54.75 46.1

If we pivot such that only our index argument is a list, we end up with rows that are MultiIndexed.

In [48]:
df_pivot = df.pivot_table(
    index=['Party', 'Candidate'], # the rows (turned into index)
    columns='Result',# the column values
    values='%', # the field(s) to processed in each group
    aggfunc=np.mean, # group operation
)
df_pivot
Out[48]:
Result loss win
Party Candidate
Democratic Carter 41.00 NaN
Clinton 48.20 46.100000
Dukakis 45.60 NaN
Gore 48.40 NaN
Kerry 48.30 NaN
Mondale 37.60 NaN
Obama NaN 52.000000
Independent Anderson 6.60 NaN
Perot 13.65 NaN
Republican Bush 37.40 50.666667
Dole 40.70 NaN
McCain 45.70 NaN
Reagan NaN 54.750000
Romney 47.20 NaN
Trump NaN 46.100000

If we pivot such that only our values argument is a list, then we again get a DataFrame with multi-indexed Columns.

In [49]:
df_pivot = df.pivot_table(
    index='Party', # the rows (turned into index)
    columns='Result',# the column values
    values=['%', 'Year'], # the field(s) to processed in each group
    aggfunc=np.mean, # group operation
)
df_pivot
Out[49]:
% Year
Result loss win loss win
Party
Democratic 44.85 49.050000 1995.333333 2002.000000
Independent 11.30 NaN 1989.333333 NaN
Republican 42.75 51.266667 2002.000000 1995.333333

Feel free to experiment with other possibilities!