A quick look at Pandas GroupBy

In [1]:
import numpy as np
import pandas as pd

Let's make a toy DF (example taken from Wes McKinney's Python for Data Analysis:

In [2]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
Out[2]:
data1 data2 key1 key2
0 -0.238995 -0.579480 a one
1 0.440934 0.000078 a two
2 0.687318 -1.390271 b one
3 -0.933768 0.187059 b two
4 0.012484 -1.788194 a one

Let's group the data1 column by the key1 column. A call to groupby does that, but what is the object that results?

In [3]:
grouped = df['data1'].groupby(df['key1'])
grouped
Out[3]:
<pandas.core.groupby.SeriesGroupBy object at 0x113acf9e8>

As we see, it's not simply a new DataFrame. Instead, it's an object that consists of groups:

In [4]:
grouped.groups
Out[4]:
{'a': Int64Index([0, 1, 4], dtype='int64'),
 'b': Int64Index([2, 3], dtype='int64')}

The grouped object is capable of making computations across all these groups:

In [5]:
grouped.mean()
Out[5]:
key1
a    0.071474
b   -0.123225
Name: data1, dtype: float64

But it can be informative to look at what's inside. We can iterate over a groupby object, as we iterate we get pairs of (name, group), where the group is either a Series or a DataFrame, depending on whether the groupby object is a SeriesGroupBy (as above) or a DataFrameGroupBy (see below):

In [13]:
from IPython.display import display  # like print, but for complex objects

for name, group in grouped:
    print('Name:', name)
    display(group)
Name: a
0   -0.238995
1    0.440934
4    0.012484
Name: data1, dtype: float64
Name: b
2    0.687318
3   -0.933768
Name: data1, dtype: float64
In [14]:
g2 = df['data1'].groupby([df['key1'], df['key2']])
g2.groups
Out[14]:
{('a', 'one'): Int64Index([0, 4], dtype='int64'),
 ('a', 'two'): Int64Index([1], dtype='int64'),
 ('b', 'one'): Int64Index([2], dtype='int64'),
 ('b', 'two'): Int64Index([3], dtype='int64')}
In [15]:
df
Out[15]:
data1 data2 key1 key2
0 -0.238995 -0.579480 a one
1 0.440934 0.000078 a two
2 0.687318 -1.390271 b one
3 -0.933768 0.187059 b two
4 0.012484 -1.788194 a one
In [16]:
g2.mean()
Out[16]:
key1  key2
a     one    -0.113255
      two     0.440934
b     one     0.687318
      two    -0.933768
Name: data1, dtype: float64

Let's group the entire dataframe on a single key. This results in a DataFrameGroupBy object as the result:

In [17]:
k1g = df.groupby('key1')
k1g
Out[17]:
<pandas.core.groupby.DataFrameGroupBy object at 0x113b66f28>
In [18]:
k1g.groups
Out[18]:
{'a': Int64Index([0, 1, 4], dtype='int64'),
 'b': Int64Index([2, 3], dtype='int64')}
In [19]:
k1g.mean()
Out[19]:
data1 data2
key1
a 0.071474 -0.789199
b -0.123225 -0.601606

But let's look at what's inside of k1g:

In [20]:
for n, g in k1g:
    print('name:', n)
    display(g)
name: a
data1 data2 key1 key2
0 -0.238995 -0.579480 a one
1 0.440934 0.000078 a two
4 0.012484 -1.788194 a one
name: b
data1 data2 key1 key2
2 0.687318 -1.390271 b one
3 -0.933768 0.187059 b two

Where did column key2 go in the mean above? It's a nuisance column, which gets automatically eliminated from an operation where it doesn't make sense (such as a numerical mean).

Grouping over a different dimension

Above, we've been grouping data along the rows, using column keys as our selectors. But we can also group along the columns, for example we can group by data type:

In [21]:
df.dtypes
Out[21]:
data1    float64
data2    float64
key1      object
key2      object
dtype: object
In [22]:
grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
    print(dtype)
    display(group)
float64
data1 data2
0 -0.238995 -0.579480
1 0.440934 0.000078
2 0.687318 -1.390271
3 -0.933768 0.187059
4 0.012484 -1.788194
object
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one