Mapping from datascience to Pandas¶

Welcome! This notebook is an unofficial resource created by the Data Science division.

It serves as an introduction to working with Python's widely used Pandas library for students who have taken data 8. The functions introduced will be analogous to those in Berkeley's datascience module, with examples provided for each.

We will cover the following topics in this notebook:

  1. Basics of Pandas
    • Importing and Loading Packages


2. [Dataframes: Working with Tabular Data](#dataframes) - [Creating a Dataframe](#creating) - [Accessing Values in Dataframe](#accessing) - [Manipulating Data](#manipulating)

3. [Visualizing Data](#visualizing) - [Histograms](#histograms) - [Line Plots](#line) - [Scatter Plots](#scatter) - [Bar Plots](#bar)

1. Basics ¶

This notebook assumes familiarity with Python concepts, syntax and data structures at the level of Data 8. For a brief refresher on some Python concepts, refer to this Python Basics Guide on Github

Python has a great ecosystem of data-centric packages which makes it excellent for data analysis. Pandas is one of those packages, and makes importing and analyzing data much easier. Pandas builds on packages like NumPy and matplotlib to give us a single, convenient, place to do most of our data analysis and visualization work.

1.1 Importing and Loading Packages ¶

It is useful to import certain packages in our workspace for analysis and data visualization. But first, we may need to install these package if they are not present already. We do this via the command line as follows:

In [1]:
# run this cell to install packages
!pip install datascience
!pip install pandas
!pip install numpy
Collecting datascience
  Downloading datascience-0.17.6-py3-none-any.whl (732 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 732.8/732.8 kB 3.3 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: folium>=0.9.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (0.14.0)
Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (67.7.2)
Requirement already satisfied: matplotlib>=3.0.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (3.5.3)
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (1.4.3)
Requirement already satisfied: scipy in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (1.9.0)
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (1.23.2)
Requirement already satisfied: ipython in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (8.14.0)
Requirement already satisfied: plotly in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (5.10.0)
Requirement already satisfied: branca in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (0.6.0)
Requirement already satisfied: jinja2>=2.9 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from folium>=0.9.1->datascience) (3.1.2)
Requirement already satisfied: requests in /srv/conda/envs/notebook/lib/python3.9/site-packages (from folium>=0.9.1->datascience) (2.28.1)
Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (4.40.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (9.2.0)
Requirement already satisfied: pyparsing>=2.2.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (2.8.2)
Requirement already satisfied: backcall in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.2.0)
Requirement already satisfied: decorator in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (5.1.1)
Requirement already satisfied: jedi>=0.16 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.18.2)
Requirement already satisfied: matplotlib-inline in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.1.6)
Requirement already satisfied: pickleshare in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (3.0.38)
Requirement already satisfied: pygments>=2.4.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (2.15.1)
Requirement already satisfied: stack-data in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.6.2)
Requirement already satisfied: traitlets>=5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (5.9.0)
Requirement already satisfied: typing-extensions in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (4.6.3)
Requirement already satisfied: pexpect>4.3 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (4.8.0)
Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pandas->datascience) (2023.3)
Requirement already satisfied: tenacity>=6.2.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from plotly->datascience) (8.2.2)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from jedi>=0.16->ipython->datascience) (0.8.3)
Requirement already satisfied: MarkupSafe>=2.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from jinja2>=2.9->folium>=0.9.1->datascience) (2.1.3)
Requirement already satisfied: ptyprocess>=0.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pexpect>4.3->ipython->datascience) (0.7.0)
Requirement already satisfied: wcwidth in /srv/conda/envs/notebook/lib/python3.9/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython->datascience) (0.2.6)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=3.0.0->datascience) (1.16.0)
Requirement already satisfied: charset-normalizer<3,>=2 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from requests->folium>=0.9.1->datascience) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from requests->folium>=0.9.1->datascience) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from requests->folium>=0.9.1->datascience) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from requests->folium>=0.9.1->datascience) (2023.5.7)
Requirement already satisfied: executing>=1.2.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from stack-data->ipython->datascience) (1.2.0)
Requirement already satisfied: asttokens>=2.1.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from stack-data->ipython->datascience) (2.2.1)
Requirement already satisfied: pure-eval in /srv/conda/envs/notebook/lib/python3.9/site-packages (from stack-data->ipython->datascience) (0.2.2)
Installing collected packages: datascience
Successfully installed datascience-0.17.6
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.9/site-packages (1.4.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pandas) (2023.3)
Requirement already satisfied: numpy>=1.18.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pandas) (1.23.2)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.9/site-packages (1.23.2)

Once we have installed the required packages, we do not need to reinstall them again when we start or reopen a Jupyter notebook. We can simply import them using the import keyword. Since we import Pandas as pd, we need to prefix all functions with pd, similar to how we prefix all numpy functions with np (such as np.append()).

In [2]:
# run this cell to import the following packages
from datascience import * # import the datascience package
import pandas as pd # import the pandas library. pd is a common shorthand for pandas
import numpy as np # import numpy for working with numbers

2. Dataframes: Working with Tabular Data ¶

In Python's datascience module, we used Table to build our dataframes and used commands such as select(), where(), group(), column() etc. In this section, we will go over some basic commands to work with tabular data in Pandas

2.1 Creating a Dataframe ¶

Pandas introduces a data structure (i.e. dataframe) that represents data as a table with columns and rows.

In Python's datascience module that is used in Data 8, this is how we created tables from scratch by extending an empty table:

In [3]:
t = Table().with_columns([
     'letter', ['a', 'b', 'c', 'z'],
     'count',  [  9,   3,   3,   1],
     'points', [  1,   2,   2,  10],
 ])
t
Out[3]:
letter count points
a 9 1
b 3 2
c 3 2
z 1 10

In Pandas, we can use the function pd.DataFrame to initialize a dataframe from a dictionary or a list-like object. Refer to the documentation for more information

In [4]:
# example: creating a dataframe from a dictionary
df_from_dict = pd.DataFrame({ 'letter' : ['a', 'b', 'c', 'z'],
                      'count' : [  9,   3,   3,   1],
                      'points' : [  1,   2,   2,  10]
                      })
df_from_dict
Out[4]:
letter count points
0 a 9 1
1 b 3 2
2 c 3 2
3 z 1 10

More often, we will need to create a dataframe by importing data from a .csv file. In datascience, this is how we read data from a csv:

In [5]:
datascience_baby = Table.read_table('data/baby.csv')
datascience_baby
Out[5]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
120 284 27 62 100 False
113 282 33 64 135 False
128 279 28 64 115 True
108 282 23 67 125 True
136 286 25 62 93 False
138 244 33 62 178 False
132 245 23 65 140 False
120 289 25 62 125 False
143 299 30 66 136 True
140 351 27 68 120 False

... (1164 rows omitted)

In Pandas, we use pd.read.csv() to read data from a csv file. Sometimes, depending on the data file, we may need to specify the parameters sep, header or encoding as well. For a full list of parameters, refer to this guide

In [6]:
# reading baby.csv (located in current working directory)
baby = pd.read_csv('data/baby.csv')
baby.head() # display first few rows of dataframe
Out[6]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
0 120 284 27 62 100 False
1 113 282 33 64 135 False
2 128 279 28 64 115 True
3 108 282 23 67 125 True
4 136 286 25 62 93 False
In [7]:
# view summary of data
baby.describe()
Out[7]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight
count 1174.000000 1174.000000 1174.000000 1174.000000 1174.000000
mean 119.462521 279.101363 27.228279 64.049404 128.478705
std 18.328671 16.010305 5.817839 2.526102 20.734282
min 55.000000 148.000000 15.000000 53.000000 87.000000
25% 108.000000 272.000000 23.000000 62.000000 114.250000
50% 120.000000 280.000000 26.000000 64.000000 125.000000
75% 131.000000 288.000000 31.000000 66.000000 139.000000
max 176.000000 353.000000 45.000000 72.000000 250.000000
In [8]:
# example: loading csv from URL
sat = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/sat2014.csv')
sat.head()
Out[8]:
State Participation Rate Critical Reading Math Writing Combined
0 North Dakota 2.3 612 620 584 1816
1 Illinois 4.6 599 616 587 1802
2 Iowa 3.1 605 611 578 1794
3 South Dakota 2.9 604 609 579 1792
4 Minnesota 5.9 598 610 578 1786
In [9]:
# view information about dataframe
print(sat.shape) # view dimensions (rows, cols)
print(sat.columns.values) # view column names
(51, 6)
['State' 'Participation Rate' 'Critical Reading' 'Math' 'Writing'
 'Combined']

2.2 Accessing Values in Dataframe ¶

In datascience, we can use column() to access values in a particular column as follows:

In [10]:
# access column 'letter'. returns array
t.column('letter')
Out[10]:
array(['a', 'b', 'c', 'z'],
      dtype='<U1')

In Pandas, columns are also known as Series. We can access a Pandas series by using the square bracket notation.

In [11]:
# returns Series object
sat['State']
Out[11]:
0             North Dakota
1                 Illinois
2                     Iowa
3             South Dakota
4                Minnesota
5                 Michigan
6                Wisconsin
7                 Missouri
8                  Wyoming
9                   Kansas
10                Kentucky
11                Nebraska
12                Colorado
13             Mississippi
14               Tennessee
15                Arkansas
16                Oklahoma
17                    Utah
18               Louisiana
19                    Ohio
20                 Montana
21                 Alabama
22              New Mexico
23           New Hampshire
24           Massachusetts
25                 Vermont
26                 Arizona
27                  Oregon
28                Virginia
29              New Jersey
30             Connecticut
31           West Virginia
32              Washington
33              California
34                  Alaska
35          North Carolina
36            Pennsylvania
37            Rhode Island
38                 Indiana
39                Maryland
40                New York
41                  Hawaii
42                  Nevada
43                 Florida
44                 Georgia
45          South Carolina
46                   Texas
47                   Maine
48                   Idaho
49                Delaware
50    District of Columbia
Name: State, dtype: object

If we want a numpy array of column values, we can call the method values on a Series object:

In [12]:
sat['State'].values
Out[12]:
array(['North Dakota', 'Illinois', 'Iowa', 'South Dakota', 'Minnesota',
       'Michigan', 'Wisconsin', 'Missouri', 'Wyoming', 'Kansas',
       'Kentucky', 'Nebraska', 'Colorado', 'Mississippi', 'Tennessee',
       'Arkansas', 'Oklahoma', 'Utah', 'Louisiana', 'Ohio', 'Montana',
       'Alabama', 'New Mexico', 'New Hampshire', 'Massachusetts',
       'Vermont', 'Arizona', 'Oregon', 'Virginia', 'New Jersey',
       'Connecticut', 'West Virginia', 'Washington', 'California',
       'Alaska', 'North Carolina', 'Pennsylvania', 'Rhode Island',
       'Indiana', 'Maryland', 'New York', 'Hawaii', 'Nevada', 'Florida',
       'Georgia', 'South Carolina', 'Texas', 'Maine', 'Idaho', 'Delaware',
       'District of Columbia'], dtype=object)

In datascience, we used take() to access a row in the Table:

In [13]:
# selecting first two rows using Python's slicing notation
t.take[0:2]
Out[13]:
letter count points
a 9 1
b 3 2

In Pandas, we can access rows and column by their position using the iloc method. We need to specify the rows and columns we want in the following syntax: df.iloc[<rows>, <columns>]. For more information on indexing, refer to this guide

In [14]:
# selecting first two rows using iloc
baby.iloc[0:2, :] 
Out[14]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
0 120 284 27 62 100 False
1 113 282 33 64 135 False
In [15]:
# specifying row indices
baby.iloc[[1, 4, 6], :]
Out[15]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
1 113 282 33 64 135 False
4 136 286 25 62 93 False
6 132 245 23 65 140 False

We can also access a specific value in the dataframe by passing in the row and column indices:

In [16]:
# get value in second row, third column
baby.iloc[1, 2]
Out[16]:
33

2.3 Manipulating Data ¶

Adding Columns

Adding a new column in datascience is done by the with_column() function as follows:

In [17]:
t.with_column('vowel', ['yes', 'no', 'no', 'no'])
t
Out[17]:
letter count points
a 9 1
b 3 2
c 3 2
z 1 10

In Pandas, we can use the bracket notation and assign a list to add to the dataframe as follows:

In [18]:
# adding a new column
df_from_dict['newcol'] = [5, 6, 7, 8]
df_from_dict
Out[18]:
letter count points newcol
0 a 9 1 5
1 b 3 2 6
2 c 3 2 7
3 z 1 10 8

We can also add an existing column to the new dataframe as a new column by performing an operation on it:

In [19]:
# adding count * 2 to the dataframe
df_from_dict['doublecount'] = df_from_dict['count'] * 2
df_from_dict
Out[19]:
letter count points newcol doublecount
0 a 9 1 5 18
1 b 3 2 6 6
2 c 3 2 7 6
3 z 1 10 8 2

Selecting Columns

In datascience, we used select() to subset the dataframe by selecting columns:

In [20]:
t.select(['letter', 'points'])
Out[20]:
letter points
a 1
b 2
c 2
z 10

In Pandas, we use a double bracket notation to select columns. This returns a dataframe, unlike a Series object when we only use single bracket notation

In [21]:
# double bracket notation for new dataframe
df_from_dict[['count', 'doublecount']]
Out[21]:
count doublecount
0 9 18
1 3 6
2 3 6
3 1 2

Filtering Rows Conditionally

In datascience, we used where() to select rows according to a given condition:

In [22]:
t.where('points', 2) # rows where points == 2
Out[22]:
letter count points
b 3 2
c 3 2
In [23]:
t.where(t['count'] < 8) # rows where count < 8
Out[23]:
letter count points
b 3 2
c 3 2
z 1 10

In Pandas, we can use the bracket notation to subset the dataframe based on a condition. We first specify a condition and then subset using the bracket notation:

In [24]:
# array of booleans
baby['Maternal Smoker'] == True
Out[24]:
0       False
1       False
2        True
3        True
4       False
        ...  
1169    False
1170    False
1171     True
1172    False
1173    False
Name: Maternal Smoker, Length: 1174, dtype: bool
In [25]:
# filter rows by condition Maternal.Smoker == True
baby[baby['Maternal Smoker'] == True]
Out[25]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
2 128 279 28 64 115 True
3 108 282 23 67 125 True
8 143 299 30 66 136 True
10 144 282 32 64 124 True
11 141 279 23 63 128 True
... ... ... ... ... ... ...
1162 143 281 28 65 135 True
1163 113 287 29 70 145 True
1164 109 244 21 63 102 True
1165 103 278 30 60 87 True
1171 130 291 30 65 150 True

459 rows × 6 columns

In [26]:
# filtering with multiple conditions
df_from_dict[(df_from_dict['count'] < 8) & (df_from_dict['points'] > 5)]
Out[26]:
letter count points newcol doublecount
3 z 1 10 8 2

Renaming Columns

In datascience, we used relabeled() to rename columns:

In [27]:
# rename 'points' to 'other name'
t.relabeled('points', 'other name')
Out[27]:
letter count other name
a 9 1
b 3 2
c 3 2
z 1 10

Pandas uses rename(), which has an index parameter that needs to be set to str and a columns parameter that needs to be set to a dictionary of the names to be replaced with their replacements:

In [28]:
# rename 'points' to 'other name'
df_from_dict.rename(index = str, columns = {"points" : "other name"})
Out[28]:
letter count other name newcol doublecount
0 a 9 1 5 18
1 b 3 2 6 6
2 c 3 2 7 6
3 z 1 10 8 2

Sorting Dataframe by Column

In datascience we used sort() to sort a Table according to the values in a column:

In [29]:
# sort by count
t.sort('count')
Out[29]:
letter count points
z 1 10
b 3 2
c 3 2
a 9 1

In Pandas, we use the sort_values() to sort by column. We need the by parameter to specify the row we want to sort by and the optional parameter ascending = False if we want to sort in descending order:

In [30]:
# sort by count, descending
df_from_dict.sort_values(by = ['count'], ascending = False)
Out[30]:
letter count points newcol doublecount
0 a 9 1 5 18
1 b 3 2 6 6
2 c 3 2 7 6
3 z 1 10 8 2

Grouping and Aggregating

In datascience, we used group() and the collect argument to group a Table by a column and aggregrate values in another column:

In [31]:
# group by count and aggregate by sum
t.select(['count', 'points']).group('count', collect=sum)
Out[31]:
count points sum
1 10
3 4
9 1

In Pandas, we use groupby() to group the dataframe. This function returns a groupby object, on which we can then call an aggregation function to return a dataframe with aggregated values for other columns. For more information, refer to the documentation

In [32]:
# selecting two columns for brevity
df_subset = df_from_dict[['count', 'points']]
df_subset
Out[32]:
count points
0 9 1
1 3 2
2 3 2
3 1 10
In [33]:
count_sums_df = df_subset.groupby(['count']).sum()
count_sums_df
Out[33]:
points
count
1 10
3 4
9 1

Pivot Tables

In datascience, we used the pivot() function to build contingency tables:

In [34]:
# creating new Table
cones_tbl = Table().with_columns(
    'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'),
    'Color', make_array('pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'),
    'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
)

cones_tbl
Out[34]:
Flavor Color Price
strawberry pink 3.55
chocolate light brown 4.75
chocolate dark brown 5.25
strawberry pink 5.25
chocolate dark brown 5.25
bubblegum pink 4.75
In [35]:
# pivoting on color and flavor
cones_tbl.pivot("Flavor", "Color")
Out[35]:
Color bubblegum chocolate strawberry
dark brown 0 2 0
light brown 0 1 0
pink 1 0 2

We can also pass in the parameters values to specify the values in the table and collect to specify the aggregration function.

In [36]:
# setting parameters values and collect
cones_tbl.pivot("Flavor", "Color", values = "Price", collect = np.sum)
Out[36]:
Color bubblegum chocolate strawberry
dark brown 0 10.5 0
light brown 0 4.75 0
pink 4.75 0 8.8

In Pandas, we use pd.pivot_table() to create a contingency table. The argument columns is similar to the first argument in datascience's pivot function and sets the column names of the pivot table. The argument index is similar to the second argument in datascience's pivot function and sets the first column of the pivot table or the keys to group on. For more information, refer to the documentation

In [37]:
# creating new dataframe
cones_df = pd.DataFrame({"Flavor" : ['strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'],
                         "Color" : ['pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'],
                         "Price" : [3.55, 4.75, 5.25, 5.25, 5.25, 4.75]})
cones_df
Out[37]:
Flavor Color Price
0 strawberry pink 3.55
1 chocolate light brown 4.75
2 chocolate dark brown 5.25
3 strawberry pink 5.25
4 chocolate dark brown 5.25
5 bubblegum pink 4.75
In [38]:
# creating the pivot table
pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"])
Out[38]:
Price
Flavor bubblegum chocolate strawberry
Color
dark brown NaN 5.25 NaN
light brown NaN 4.75 NaN
pink 4.75 NaN 4.4

If there is no data in the groups, then Pandas will output NaN values.

We can also specify the parameters like values (equivalent to values in datascience's pivot) and aggfunc (equivalent to collect in datascience's pivot)

In [39]:
# additional arguments
pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"], values = "Price", aggfunc=np.sum)
Out[39]:
Flavor bubblegum chocolate strawberry
Color
dark brown NaN 10.50 NaN
light brown NaN 4.75 NaN
pink 4.75 NaN 8.8

Joining/Merging

In datascience, we used join() to join two tables based on shared values in columns. We specify the column name in the first table to match on, the name of the second table and the column name in the second table to match on.

In [40]:
# creating new table
ratings_tbl = Table().with_columns(
    'Kind', make_array('strawberry', 'chocolate', 'vanilla'),
    'Stars', make_array(2.5, 3.5, 4)
)
ratings_tbl
Out[40]:
Kind Stars
strawberry 2.5
chocolate 3.5
vanilla 4
In [41]:
# joining cones and ratings
cones_tbl.join("Flavor", ratings_tbl, "Kind")
Out[41]:
Flavor Color Price Stars
chocolate light brown 4.75 3.5
chocolate dark brown 5.25 3.5
chocolate dark brown 5.25 3.5
strawberry pink 3.55 2.5
strawberry pink 5.25 2.5

In Pandas, we can use the merge() function to join two tables together. The first parameter is the name of the second table to join on. The parameters left_on and right_on specify the columns to use in the left and right tables respectively. There are more parameters such as how which specify what kind of join to perform (Inner (Default), Outer, Left, Right). For more information, refer to this Kaggle Tutorial

In [42]:
# creating new ratings df
ratings_df = pd.DataFrame({"Kind" : ['strawberry', 'chocolate', 'vanilla'],
                           "Stars" : [2.5, 3.5, 4]})
ratings_df
Out[42]:
Kind Stars
0 strawberry 2.5
1 chocolate 3.5
2 vanilla 4.0
In [43]:
# merging cones and ratings
cones_df.merge(ratings_df, left_on = "Flavor", right_on = "Kind")
Out[43]:
Flavor Color Price Kind Stars
0 strawberry pink 3.55 strawberry 2.5
1 strawberry pink 5.25 strawberry 2.5
2 chocolate light brown 4.75 chocolate 3.5
3 chocolate dark brown 5.25 chocolate 3.5
4 chocolate dark brown 5.25 chocolate 3.5

3. Visualizing Data ¶

In datascience, we learned to plot data using histograms, line plots, scatter plots and histograms. The corresponding functions were hist(), plot(), scatter() and barh(). Plotting methods in Pandas are nearly identical to datascience since both build on the library matplotlib

In this section we will go through examples of such plots in Pandas

3.1 Histograms

In datascience, we used hist() to create a histogram. In this example, we will be using data from baby.csv. Recall that the baby data set contains data on a random sample of 1,174 mothers and their newborn babies. The column Birth.Weight contains the birth weight of the baby, in ounces; Gestational.Days is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.

In [45]:
# importing matplotlib for plotting
import matplotlib
%matplotlib inline
In [46]:
# reading in the data
datascience_baby = Table.read_table('data/baby.csv')
datascience_baby
Out[46]:
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
120 284 27 62 100 False
113 282 33 64 135 False
128 279 28 64 115 True
108 282 23 67 125 True
136 286 25 62 93 False
138 244 33 62 178 False
132 245 23 65 140 False
120 289 25 62 125 False
143 299 30 66 136 True
140 351 27 68 120 False

... (1164 rows omitted)

In [47]:
# creating a histogram
datascience_baby.hist('Birth Weight')

In Pandas, we use hist() to create histograms, just like datascience. Refer to the documentation for a full list of parameters

In [48]:
# creating a histogram
baby.hist('Birth Weight')
Out[48]:
array([[<AxesSubplot:title={'center':'Birth Weight'}>]], dtype=object)

3.2 Line Plots

In datascience, we used plot() to create a line plot of numerical values. In this example, we will be using census data and plot variables such as Age in a line plot

In [49]:
# line plot in datascience
census_tbl = Table.read_table("https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv").select(['SEX', 'AGE', 'POPESTIMATE2014'])
children_tbl = census_tbl.where('SEX', are.equal_to(0)).where('AGE', are.below(19)).drop('SEX')
children_tbl.plot('AGE')

In Pandas, we can use plot.line() to create line plots. For a full list of parameters, refer to the documentation

In [50]:
#Pandas
census_df = pd.read_csv("https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv")[["SEX", "AGE", "POPESTIMATE2014"]]
children_df = census_df[(census_df.SEX == 0) & (census_df.AGE < 19)].drop("SEX", axis=1)
children_df.plot.line(x="AGE", y="POPESTIMATE2014")
Out[50]:
<AxesSubplot:xlabel='AGE'>

3.3 Scatter Plots

In datascience, we used scatter() to create a scatter plot of two numerical columns

In [51]:
football_tbl = Table.read_table('https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/deflategate.csv')
football_tbl
Out[51]:
Team Blakeman Prioleau
Patriots 11.5 11.8
Patriots 10.85 11.2
Patriots 11.15 11.5
Patriots 10.7 11
Patriots 11.1 11.45
Patriots 11.6 11.95
Patriots 11.85 12.3
Patriots 11.1 11.55
Patriots 10.95 11.35
Patriots 10.5 10.9

... (5 rows omitted)

In [52]:
football_tbl.scatter('Blakeman', 'Prioleau')

In Pandas, we use plot.scatter() to create a scatter plot. For a full list of parameters, refer to the documentation

In [53]:
football_df = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/deflategate.csv')
football_df.plot.scatter(x="Blakeman", y="Prioleau")
Out[53]:
<AxesSubplot:xlabel='Blakeman', ylabel='Prioleau'>

3.4 Bar Plots

In datascience, we used barh() to create a horizontal bar plot

In [54]:
t.barh("letter", "points")

In Pandas, we use plot.barh() to create a bar chart. For a full list of parameters, refer to the documentation

In [55]:
df_from_dict.plot.barh(x='letter', y='points')
Out[55]:
<AxesSubplot:ylabel='letter'>

Further Reading¶

Here is a list of useful Pandas resources:

  • Pandas Documentation
  • Dataquest Pandas Tutorial
  • Pandas Cookbook