A quick look at Pandas GroupBy¶

import numpy as np
import pandas as pd

Let's make a toy DF (example taken from Wes McKinney's Python for Data Analysis:

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Let's group the data1 column by the key1 column. A call to groupby does that, but what is the object that results?

grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x113acf9e8>

As we see, it's not simply a new DataFrame. Instead, it's an object that consists of groups:

grouped.groups

{'a': Int64Index([0, 1, 4], dtype='int64'),
 'b': Int64Index([2, 3], dtype='int64')}

The grouped object is capable of making computations across all these groups:

grouped.mean()

key1
a    0.071474
b   -0.123225
Name: data1, dtype: float64

But it can be informative to look at what's inside. We can iterate over a groupby object, as we iterate we get pairs of (name, group), where the group is either a Series or a DataFrame, depending on whether the groupby object is a SeriesGroupBy (as above) or a DataFrameGroupBy (see below):

from IPython.display import display  # like print, but for complex objects

for name, group in grouped:
    print('Name:', name)
    display(group)

Name: a

0   -0.238995
1    0.440934
4    0.012484
Name: data1, dtype: float64

Name: b

2    0.687318
3   -0.933768
Name: data1, dtype: float64

g2 = df['data1'].groupby([df['key1'], df['key2']])
g2.groups

{('a', 'one'): Int64Index([0, 4], dtype='int64'),
 ('a', 'two'): Int64Index([1], dtype='int64'),
 ('b', 'one'): Int64Index([2], dtype='int64'),
 ('b', 'two'): Int64Index([3], dtype='int64')}

df

g2.mean()

key1  key2
a     one    -0.113255
      two     0.440934
b     one     0.687318
      two    -0.933768
Name: data1, dtype: float64

Let's group the entire dataframe on a single key. This results in a DataFrameGroupBy object as the result:

k1g = df.groupby('key1')
k1g

<pandas.core.groupby.DataFrameGroupBy object at 0x113b66f28>

k1g.groups

{'a': Int64Index([0, 1, 4], dtype='int64'),
 'b': Int64Index([2, 3], dtype='int64')}

k1g.mean()

But let's look at what's inside of k1g:

for n, g in k1g:
    print('name:', n)
    display(g)

name: a

name: b

Where did column key2 go in the mean above? It's a nuisance column, which gets automatically eliminated from an operation where it doesn't make sense (such as a numerical mean).

Grouping over a different dimension¶

Above, we've been grouping data along the rows, using column keys as our selectors. But we can also group along the columns, for example we can group by data type:

df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
    print(dtype)
    display(group)

float64

object

	data1	data2	key1	key2
0	-0.238995	-0.579480	a	one
1	0.440934	0.000078	a	two
2	0.687318	-1.390271	b	one
3	-0.933768	0.187059	b	two
4	0.012484	-1.788194	a	one