Pandas, Part I, Demo¶

By Narges Norouzi

In this notebook we practice with what we learned about Pandas in the lecture. The dataset we use the Diversity Index of US counties dataset obtained (and cleaned and modified slightly) from kaggle (link). The dataset includes information about counties in the USA and associated diversity index. The diversity index is defined as $D = 1 - \sum(\frac{n}{N})^2$ (where $n$ = number of people of a given race and $N$ is the total number of people of all races, to get the probability of randomly selecting two people and getting two people of different races (ecological entropy)).

In [56]:
import pandas as pd

Reading the data¶

In [57]:
df = pd.read_csv("data/diversityindex.csv")

Looking at the head and tail of the data¶

In [58]:
df.head()
Out[58]:
Location State County Diversity-Index
0 Aleutians West Census Area, AK AK Aleutians West Census Area 0.769346
1 Queens County, NY NY Queens County 0.742224
2 Maui County, HI HI Maui County 0.740757
3 Alameda County, CA CA Alameda County 0.740399
4 Aleutians East Borough, AK AK Aleutians East Borough 0.738867
In [59]:
df.tail()
Out[59]:
Location State County Diversity-Index
3138 Osage County, MO MO Osage County 0.037540
3139 Lincoln County, WV WV Lincoln County 0.035585
3140 Leslie County, KY KY Leslie County 0.035581
3141 Blaine County, NE NE Blaine County 0.023784
3142 Keya Paha County, NE NE Keya Paha County 0.021816

Using loc to slice the DataFrame¶

In [60]:
df.loc[3, "Diversity-Index"]
Out[60]:
0.740399
In [61]:
df.loc[len(df)//2-1:len(df)//2+1:, "Location"]
Out[61]:
1570     Cleveland County, AR
1571    Lauderdale County, AL
1572      Hamilton County, IN
Name: Location, dtype: object
In [62]:
df.loc[[0, 10, 20, 50], "State":"Diversity-Index"]
Out[62]:
State County Diversity-Index
0 AK Aleutians West Census Area 0.769346
10 NC Robeson County 0.704067
20 CA Contra Costa County 0.686497
50 CA Sutter County 0.647059
In [63]:
df.loc[1, :]
Out[63]:
Location           Queens County, NY
State                             NY
County                 Queens County
Diversity-Index             0.742224
Name: 1, dtype: object
In [64]:
df.set_index("County", inplace=True)
df
Out[64]:
Location State Diversity-Index
County
Aleutians West Census Area Aleutians West Census Area, AK AK 0.769346
Queens County Queens County, NY NY 0.742224
Maui County Maui County, HI HI 0.740757
Alameda County Alameda County, CA CA 0.740399
Aleutians East Borough Aleutians East Borough, AK AK 0.738867
... ... ... ...
Osage County Osage County, MO MO 0.037540
Lincoln County Lincoln County, WV WV 0.035585
Leslie County Leslie County, KY KY 0.035581
Blaine County Blaine County, NE NE 0.023784
Keya Paha County Keya Paha County, NE NE 0.021816

3143 rows × 3 columns

In [65]:
df.loc["Los Angeles County", :]
Out[65]:
Location           Los Angeles County, CA
State                                  CA
Diversity-Index                  0.661865
Name: Los Angeles County, dtype: object
In [66]:
df.reset_index(inplace=True)
df
Out[66]:
County Location State Diversity-Index
0 Aleutians West Census Area Aleutians West Census Area, AK AK 0.769346
1 Queens County Queens County, NY NY 0.742224
2 Maui County Maui County, HI HI 0.740757
3 Alameda County Alameda County, CA CA 0.740399
4 Aleutians East Borough Aleutians East Borough, AK AK 0.738867
... ... ... ... ...
3138 Osage County Osage County, MO MO 0.037540
3139 Lincoln County Lincoln County, WV WV 0.035585
3140 Leslie County Leslie County, KY KY 0.035581
3141 Blaine County Blaine County, NE NE 0.023784
3142 Keya Paha County Keya Paha County, NE NE 0.021816

3143 rows × 4 columns

Using iloc to slice the DataFrame¶

In [67]:
df.iloc[1, 0:1]
Out[67]:
County    Queens County
Name: 1, dtype: object
In [68]:
df.iloc[10:20, :]
Out[68]:
County Location State Diversity-Index
10 Robeson County Robeson County, NC NC 0.704067
11 Gwinnett County Gwinnett County, GA GA 0.702974
12 Yakutat City and Borough Yakutat City and Borough, AK AK 0.698748
13 Santa Clara County Santa Clara County, CA CA 0.694312
14 Kings County Kings County, NY NY 0.692349
15 San Mateo County San Mateo County, CA CA 0.691029
16 Manassas Park city Manassas Park city, VA VA 0.690899
17 Dallas County Dallas County, TX TX 0.690390
18 Montgomery County Montgomery County, MD MD 0.687803
19 Sacramento County Sacramento County, CA CA 0.687281
In [69]:
df.iloc[[13, 17], [0, 2, 3]]
Out[69]:
County State Diversity-Index
13 Santa Clara County CA 0.694312
17 Dallas County TX 0.690390

Using [] to slice the DataFrame¶

In [70]:
df[-5:]
Out[70]:
County Location State Diversity-Index
3138 Osage County Osage County, MO MO 0.037540
3139 Lincoln County Lincoln County, WV WV 0.035585
3140 Leslie County Leslie County, KY KY 0.035581
3141 Blaine County Blaine County, NE NE 0.023784
3142 Keya Paha County Keya Paha County, NE NE 0.021816
In [71]:
df["Diversity-Index"]
Out[71]:
0       0.769346
1       0.742224
2       0.740757
3       0.740399
4       0.738867
          ...   
3138    0.037540
3139    0.035585
3140    0.035581
3141    0.023784
3142    0.021816
Name: Diversity-Index, Length: 3143, dtype: float64
In [72]:
df[["County", "State", "Location"]]
Out[72]:
County State Location
0 Aleutians West Census Area AK Aleutians West Census Area, AK
1 Queens County NY Queens County, NY
2 Maui County HI Maui County, HI
3 Alameda County CA Alameda County, CA
4 Aleutians East Borough AK Aleutians East Borough, AK
... ... ... ...
3138 Osage County MO Osage County, MO
3139 Lincoln County WV Lincoln County, WV
3140 Leslie County KY Leslie County, KY
3141 Blaine County NE Blaine County, NE
3142 Keya Paha County NE Keya Paha County, NE

3143 rows × 3 columns