By Narges Norouzi
In this notebook we practice with what we learned about Pandas
in the lecture. The dataset we use the Diversity Index of US counties dataset obtained (and cleaned and modified slightly) from kaggle (link). The dataset includes information about counties in the USA and associated diversity index. The diversity index is defined as $D = 1 - \sum(\frac{n}{N})^2$ (where $n$ = number of people of a given race and $N$ is the total number of people of all races, to get the probability of randomly selecting two people and getting two people of different races (ecological entropy)).
import pandas as pd
df = pd.read_csv("data/diversityindex.csv")
head
and tail
of the data¶df.head()
Location | State | County | Diversity-Index | |
---|---|---|---|---|
0 | Aleutians West Census Area, AK | AK | Aleutians West Census Area | 0.769346 |
1 | Queens County, NY | NY | Queens County | 0.742224 |
2 | Maui County, HI | HI | Maui County | 0.740757 |
3 | Alameda County, CA | CA | Alameda County | 0.740399 |
4 | Aleutians East Borough, AK | AK | Aleutians East Borough | 0.738867 |
df.tail()
Location | State | County | Diversity-Index | |
---|---|---|---|---|
3138 | Osage County, MO | MO | Osage County | 0.037540 |
3139 | Lincoln County, WV | WV | Lincoln County | 0.035585 |
3140 | Leslie County, KY | KY | Leslie County | 0.035581 |
3141 | Blaine County, NE | NE | Blaine County | 0.023784 |
3142 | Keya Paha County, NE | NE | Keya Paha County | 0.021816 |
loc
to slice the DataFrame¶df.loc[3, "Diversity-Index"]
0.740399
df.loc[len(df)//2-1:len(df)//2+1:, "Location"]
1570 Cleveland County, AR 1571 Lauderdale County, AL 1572 Hamilton County, IN Name: Location, dtype: object
df.loc[[0, 10, 20, 50], "State":"Diversity-Index"]
State | County | Diversity-Index | |
---|---|---|---|
0 | AK | Aleutians West Census Area | 0.769346 |
10 | NC | Robeson County | 0.704067 |
20 | CA | Contra Costa County | 0.686497 |
50 | CA | Sutter County | 0.647059 |
df.loc[1, :]
Location Queens County, NY State NY County Queens County Diversity-Index 0.742224 Name: 1, dtype: object
df.set_index("County", inplace=True)
df
Location | State | Diversity-Index | |
---|---|---|---|
County | |||
Aleutians West Census Area | Aleutians West Census Area, AK | AK | 0.769346 |
Queens County | Queens County, NY | NY | 0.742224 |
Maui County | Maui County, HI | HI | 0.740757 |
Alameda County | Alameda County, CA | CA | 0.740399 |
Aleutians East Borough | Aleutians East Borough, AK | AK | 0.738867 |
... | ... | ... | ... |
Osage County | Osage County, MO | MO | 0.037540 |
Lincoln County | Lincoln County, WV | WV | 0.035585 |
Leslie County | Leslie County, KY | KY | 0.035581 |
Blaine County | Blaine County, NE | NE | 0.023784 |
Keya Paha County | Keya Paha County, NE | NE | 0.021816 |
3143 rows × 3 columns
df.loc["Los Angeles County", :]
Location Los Angeles County, CA State CA Diversity-Index 0.661865 Name: Los Angeles County, dtype: object
df.reset_index(inplace=True)
df
County | Location | State | Diversity-Index | |
---|---|---|---|---|
0 | Aleutians West Census Area | Aleutians West Census Area, AK | AK | 0.769346 |
1 | Queens County | Queens County, NY | NY | 0.742224 |
2 | Maui County | Maui County, HI | HI | 0.740757 |
3 | Alameda County | Alameda County, CA | CA | 0.740399 |
4 | Aleutians East Borough | Aleutians East Borough, AK | AK | 0.738867 |
... | ... | ... | ... | ... |
3138 | Osage County | Osage County, MO | MO | 0.037540 |
3139 | Lincoln County | Lincoln County, WV | WV | 0.035585 |
3140 | Leslie County | Leslie County, KY | KY | 0.035581 |
3141 | Blaine County | Blaine County, NE | NE | 0.023784 |
3142 | Keya Paha County | Keya Paha County, NE | NE | 0.021816 |
3143 rows × 4 columns
iloc
to slice the DataFrame¶df.iloc[1, 0:1]
County Queens County Name: 1, dtype: object
df.iloc[10:20, :]
County | Location | State | Diversity-Index | |
---|---|---|---|---|
10 | Robeson County | Robeson County, NC | NC | 0.704067 |
11 | Gwinnett County | Gwinnett County, GA | GA | 0.702974 |
12 | Yakutat City and Borough | Yakutat City and Borough, AK | AK | 0.698748 |
13 | Santa Clara County | Santa Clara County, CA | CA | 0.694312 |
14 | Kings County | Kings County, NY | NY | 0.692349 |
15 | San Mateo County | San Mateo County, CA | CA | 0.691029 |
16 | Manassas Park city | Manassas Park city, VA | VA | 0.690899 |
17 | Dallas County | Dallas County, TX | TX | 0.690390 |
18 | Montgomery County | Montgomery County, MD | MD | 0.687803 |
19 | Sacramento County | Sacramento County, CA | CA | 0.687281 |
df.iloc[[13, 17], [0, 2, 3]]
County | State | Diversity-Index | |
---|---|---|---|
13 | Santa Clara County | CA | 0.694312 |
17 | Dallas County | TX | 0.690390 |
[]
to slice the DataFrame¶df[-5:]
County | Location | State | Diversity-Index | |
---|---|---|---|---|
3138 | Osage County | Osage County, MO | MO | 0.037540 |
3139 | Lincoln County | Lincoln County, WV | WV | 0.035585 |
3140 | Leslie County | Leslie County, KY | KY | 0.035581 |
3141 | Blaine County | Blaine County, NE | NE | 0.023784 |
3142 | Keya Paha County | Keya Paha County, NE | NE | 0.021816 |
df["Diversity-Index"]
0 0.769346 1 0.742224 2 0.740757 3 0.740399 4 0.738867 ... 3138 0.037540 3139 0.035585 3140 0.035581 3141 0.023784 3142 0.021816 Name: Diversity-Index, Length: 3143, dtype: float64
df[["County", "State", "Location"]]
County | State | Location | |
---|---|---|---|
0 | Aleutians West Census Area | AK | Aleutians West Census Area, AK |
1 | Queens County | NY | Queens County, NY |
2 | Maui County | HI | Maui County, HI |
3 | Alameda County | CA | Alameda County, CA |
4 | Aleutians East Borough | AK | Aleutians East Borough, AK |
... | ... | ... | ... |
3138 | Osage County | MO | Osage County, MO |
3139 | Lincoln County | WV | Lincoln County, WV |
3140 | Leslie County | KY | Leslie County, KY |
3141 | Blaine County | NE | Blaine County, NE |
3142 | Keya Paha County | NE | Keya Paha County, NE |
3143 rows × 3 columns