import numpy as np
import pandas as pd
Let's make a toy DF (example taken from Wes McKinney's Python for Data Analysis:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
df
Let's group the data1
column by the key1
column. A call to groupby
does that, but what is the object that results?
grouped = df['data1'].groupby(df['key1'])
grouped
As we see, it's not simply a new DataFrame. Instead, it's an object that consists of groups
:
grouped.groups
The grouped
object is capable of making computations across all these groups:
grouped.mean()
But it can be informative to look at what's inside. We can iterate over a groupby
object, as we iterate we get pairs of (name, group)
, where the group
is either a Series
or a DataFrame
, depending on whether the groupby
object is a SeriesGroupBy
(as above) or a DataFrameGroupBy
(see below):
from IPython.display import display # like print, but for complex objects
for name, group in grouped:
print('Name:', name)
display(group)
g2 = df['data1'].groupby([df['key1'], df['key2']])
g2.groups
df
g2.mean()
Let's group the entire dataframe on a single key. This results in a DataFrameGroupBy
object as the result:
k1g = df.groupby('key1')
k1g
k1g.groups
k1g.mean()
But let's look at what's inside of k1g:
for n, g in k1g:
print('name:', n)
display(g)
Where did column key2
go in the mean above? It's a nuisance column, which gets automatically eliminated from an operation where it doesn't make sense (such as a numerical mean).
Above, we've been grouping data along the rows, using column keys as our selectors. But we can also group along the columns, for example we can group by data type:
df.dtypes
grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
print(dtype)
display(group)