Adapted from Lisa Yan
Updates by Dominic Liu
A demo of data cleaning and exploratory data analysis using the CDC Tuberculosis data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)
What can we say about the presence of Tuberculosis in the United States?
Let's look at the data included in the original CDC article published in 2021.
Suppose Table 1 was saved as a CSV file located in data/cdc_tuberculosis.csv
.
We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
pd.read_csv()
1, 2. Let's start with the first two so we really solidify the idea of a CSV as rectangular data (i.e., tabular data) stored as comma-separated values.
tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
tb_df
Unnamed: 0 | No. of TB cases | Unnamed: 2 | Unnamed: 3 | TB incidence | Unnamed: 5 | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | U.S. jurisdiction | 2019 | 2020 | 2021 | 2019.00 | 2020.00 | 2021.00 |
1 | Total | 8,900 | 7,173 | 7,860 | 2.71 | 2.16 | 2.37 |
2 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 |
3 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 |
4 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 |
... | ... | ... | ... | ... | ... | ... | ... |
48 | Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 |
49 | Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 |
50 | West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 |
51 | Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 |
52 | Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 |
53 rows × 7 columns
Wait, what's up with the "Unnamed" column names? And the first row, for that matter?
Congratulations -- you're ready to wrangle your data. Because of how things are stored, we'll need to clean the data a bit to name our columns better.
A reasonable first step is to identify the row with the right header. The pd.read_csv()
function (documentation) has the convenient header
parameter:
tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
tb_df
U.S. jurisdiction | 2019 | 2020 | 2021 | 2019.1 | 2020.1 | 2021.1 | |
---|---|---|---|---|---|---|---|
0 | Total | 8,900 | 7,173 | 7,860 | 2.71 | 2.16 | 2.37 |
1 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 |
2 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 |
3 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 |
4 | Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 |
... | ... | ... | ... | ... | ... | ... | ... |
47 | Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 |
48 | Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 |
49 | West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 |
50 | Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 |
51 | Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 |
52 rows × 7 columns
Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. pandas has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us as humans understand the data.
We can do this manually with df.rename()
(documentation):
rename_dict = {'2019': 'TB cases 2019',
'2020': 'TB cases 2020',
'2021': 'TB cases 2021',
'2019.1': 'TB incidence 2019',
'2020.1': 'TB incidence 2020',
'2021.1': 'TB incidence 2021'}
tb_df = tb_df.rename(columns=rename_dict)
tb_df
U.S. jurisdiction | TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | |
---|---|---|---|---|---|---|---|
0 | Total | 8,900 | 7,173 | 7,860 | 2.71 | 2.16 | 2.37 |
1 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 |
2 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 |
3 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 |
4 | Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 |
... | ... | ... | ... | ... | ... | ... | ... |
47 | Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 |
48 | Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 |
49 | West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 |
50 | Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 |
51 | Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 |
52 rows × 7 columns
You might already be wondering: What's up with that first record?
Row 0 is what we call a rollup record, or summary record. It's often useful when displaying tables to humans. The granularity of record 0 (Totals) vs the rest of the records (States) is different.
Okay, EDA step two. How was the rollup record aggregated?
Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get 2x the total cases in each of our TB cases by year (why?).
tb_df.sum(axis=0)
U.S. jurisdiction TotalAlabamaAlaskaArizonaArkansasCaliforniaCol... TB cases 2019 8,9008758183642,111666718245583029973261085237... TB cases 2020 7,1737258136591,706525417194122219282169239376... TB cases 2021 7,8609258129691,750585443194992281064255127494... TB incidence 2019 109.94 TB incidence 2020 93.09 TB incidence 2021 102.94 dtype: object
Whoa, what's going on? Check out the column types:
tb_df.dtypes
U.S. jurisdiction object TB cases 2019 object TB cases 2020 object TB cases 2021 object TB incidence 2019 float64 TB incidence 2020 float64 TB incidence 2021 float64 dtype: object
Looks like those commas are causing all TB cases to be read as the object
datatype, or storage type (close to the Python string datatype), so pandas is concatenating strings instead of adding integers.
Fortunately read_csv
also has a thousands
parameter (for what it's worth, I didn't know this beforehand--I googled this):
# improve readability: chaining method calls with outer parentheses/line breaks
tb_df = (
pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
.rename(columns=rename_dict)
)
tb_df
U.S. jurisdiction | TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | |
---|---|---|---|---|---|---|---|
0 | Total | 8900 | 7173 | 7860 | 2.71 | 2.16 | 2.37 |
1 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 |
2 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 |
3 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 |
4 | Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 |
... | ... | ... | ... | ... | ... | ... | ... |
47 | Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 |
48 | Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 |
49 | West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 |
50 | Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 |
51 | Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 |
52 rows × 7 columns
tb_df.sum()
U.S. jurisdiction TotalAlabamaAlaskaArizonaArkansasCaliforniaCol... TB cases 2019 17800 TB cases 2020 14346 TB cases 2021 15720 TB incidence 2019 109.94 TB incidence 2020 93.09 TB incidence 2021 102.94 dtype: object
The Total TB cases look right. Phew!
(We'll leave it to your own EDA to figure out how the TB incidence "Totals" were aggregated...you may want to check out the bonus section first, though.)
Let's just look at the records with state-level granularity:
state_tb_df = tb_df[1:]
state_tb_df
U.S. jurisdiction | TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | |
---|---|---|---|---|---|---|---|
1 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 |
2 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 |
3 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 |
4 | Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 |
5 | California | 2111 | 1706 | 1750 | 5.35 | 4.32 | 4.46 |
... | ... | ... | ... | ... | ... | ... | ... |
47 | Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 |
48 | Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 |
49 | West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 |
50 | Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 |
51 | Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 |
51 rows × 7 columns
What do each of these values represent? Why?
To the lecture!
U.S. Census population estimates source (2019), source (2020-2021).
Running the below cells cleans the data. We encourage you to closely explore the CSV and study these lines after lecture...
There are a few new methods here:
df.convert_dtypes()
(documentation) conveniently converts all float dtypes into ints and is out of scope for the class.df.drop_na()
(documentation) will be explained in more detail next time.# 2010s census data
census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
census_2010s_df = (
census_2010s_df
.reset_index()
.drop(columns=["index", "Census", "Estimates Base"])
.rename(columns={"Unnamed: 0": "Geographic Area"})
.convert_dtypes() # "smart" converting of columns, use at your own risk
.dropna() # we'll introduce this next time
)
census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
# with pd.option_context('display.min_rows', 30): # shows more rows
# display(census_2010s_df)
census_2010s_df
Geographic Area | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 309321666 | 311556874 | 313830990 | 315993715 | 318301008 | 320635163 | 322941311 | 324985539 | 326687501 | 328239523 |
1 | Northeast | 55380134 | 55604223 | 55775216 | 55901806 | 56006011 | 56034684 | 56042330 | 56059240 | 56046620 | 55982803 |
2 | Midwest | 66974416 | 67157800 | 67336743 | 67560379 | 67745167 | 67860583 | 67987540 | 68126781 | 68236628 | 68329004 |
3 | South | 114866680 | 116006522 | 117241208 | 118364400 | 119624037 | 120997341 | 122351760 | 123542189 | 124569433 | 125580448 |
4 | West | 72100436 | 72788329 | 73477823 | 74167130 | 74925793 | 75742555 | 76559681 | 77257329 | 77834820 | 78347268 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
52 | Washington | 6742830 | 6826627 | 6897058 | 6963985 | 7054655 | 7163657 | 7294771 | 7423362 | 7523869 | 7614893 |
53 | West Virginia | 1854239 | 1856301 | 1856872 | 1853914 | 1849489 | 1842050 | 1831023 | 1817004 | 1804291 | 1792147 |
54 | Wisconsin | 5690475 | 5705288 | 5719960 | 5736754 | 5751525 | 5760940 | 5772628 | 5790186 | 5807406 | 5822434 |
55 | Wyoming | 564487 | 567299 | 576305 | 582122 | 582531 | 585613 | 584215 | 578931 | 577601 | 578759 |
57 | Puerto Rico | 3721525 | 3678732 | 3634488 | 3593077 | 3534874 | 3473232 | 3406672 | 3325286 | 3193354 | 3193694 |
57 rows × 11 columns
Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use the python importlib library:
from importlib import reload
reload(utils)
or use iPython magic which will intelligently import code when files change:
%load_ext autoreload
%autoreload 2
# census 2020s data
census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
census_2020s_df = (
census_2020s_df
.reset_index()
.drop(columns=["index", "Unnamed: 1"])
.rename(columns={"Unnamed: 0": "Geographic Area"})
.convert_dtypes() # "smart" converting of columns, use at your own risk
.dropna() # we'll introduce this next time
)
census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
census_2020s_df
Geographic Area | 2020 | 2021 | 2022 | |
---|---|---|---|---|
0 | United States | 331511512 | 332031554 | 333287557 |
1 | Northeast | 57448898 | 57259257 | 57040406 |
2 | Midwest | 68961043 | 68836505 | 68787595 |
3 | South | 126450613 | 127346029 | 128716192 |
4 | West | 78650958 | 78589763 | 78743364 |
... | ... | ... | ... | ... |
52 | Washington | 7724031 | 7740745 | 7785786 |
53 | West Virginia | 1791420 | 1785526 | 1775156 |
54 | Wisconsin | 5896271 | 5880101 | 5892539 |
55 | Wyoming | 577605 | 579483 | 581381 |
57 | Puerto Rico | 3281557 | 3262693 | 3221789 |
57 rows × 4 columns
Time to merge
! Here I use the DataFrame method df1.merge(right=df2, ...)
on DataFrame df1
(documentation). Contrast this with the function pd.merge(left=df1, right=df2, ...)
(documentation). Feel free to use either.
# merge TB dataframe with two US census dataframes
tb_census_df = (
tb_df
.merge(right=census_2010s_df,
left_on="U.S. jurisdiction", right_on="Geographic Area")
.merge(right=census_2020s_df,
left_on="U.S. jurisdiction", right_on="Geographic Area")
)
tb_census_df
U.S. jurisdiction | TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | Geographic Area_x | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | Geographic Area_y | 2020 | 2021 | 2022 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 | Alabama | 4785437 | 4799069 | 4815588 | 4830081 | 4841799 | 4852347 | 4863525 | 4874486 | 4887681 | 4903185 | Alabama | 5031362 | 5049846 | 5074296 |
1 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 | Alaska | 713910 | 722128 | 730443 | 737068 | 736283 | 737498 | 741456 | 739700 | 735139 | 731545 | Alaska | 732923 | 734182 | 733583 |
2 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 | Arizona | 6407172 | 6472643 | 6554978 | 6632764 | 6730413 | 6829676 | 6941072 | 7044008 | 7158024 | 7278717 | Arizona | 7179943 | 7264877 | 7359197 |
3 | Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 | Arkansas | 2921964 | 2940667 | 2952164 | 2959400 | 2967392 | 2978048 | 2989918 | 3001345 | 3009733 | 3017804 | Arkansas | 3014195 | 3028122 | 3045637 |
4 | California | 2111 | 1706 | 1750 | 5.35 | 4.32 | 4.46 | California | 37319502 | 37638369 | 37948800 | 38260787 | 38596972 | 38918045 | 39167117 | 39358497 | 39461588 | 39512223 | California | 39501653 | 39142991 | 39029342 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
46 | Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 | Virginia | 8023699 | 8101155 | 8185080 | 8252427 | 8310993 | 8361808 | 8410106 | 8463587 | 8501286 | 8535519 | Virginia | 8636471 | 8657365 | 8683619 |
47 | Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 | Washington | 6742830 | 6826627 | 6897058 | 6963985 | 7054655 | 7163657 | 7294771 | 7423362 | 7523869 | 7614893 | Washington | 7724031 | 7740745 | 7785786 |
48 | West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 | West Virginia | 1854239 | 1856301 | 1856872 | 1853914 | 1849489 | 1842050 | 1831023 | 1817004 | 1804291 | 1792147 | West Virginia | 1791420 | 1785526 | 1775156 |
49 | Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 | Wisconsin | 5690475 | 5705288 | 5719960 | 5736754 | 5751525 | 5760940 | 5772628 | 5790186 | 5807406 | 5822434 | Wisconsin | 5896271 | 5880101 | 5892539 |
50 | Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 | Wyoming | 564487 | 567299 | 576305 | 582122 | 582531 | 585613 | 584215 | 578931 | 577601 | 578759 | Wyoming | 577605 | 579483 | 581381 |
51 rows × 22 columns
This is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census DataFrames. Let's do the latter.
# try merging again, but cleaner this time
tb_census_df = (
tb_df
.merge(right=census_2010s_df[["Geographic Area", "2019"]],
left_on="U.S. jurisdiction", right_on="Geographic Area")
.drop(columns="Geographic Area")
.merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
left_on="U.S. jurisdiction", right_on="Geographic Area")
.drop(columns="Geographic Area")
)
tb_census_df
U.S. jurisdiction | TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | 2019 | 2020 | 2021 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 | 4903185 | 5031362 | 5049846 |
1 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 | 731545 | 732923 | 734182 |
2 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 | 7278717 | 7179943 | 7264877 |
3 | Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 | 3017804 | 3014195 | 3028122 |
4 | California | 2111 | 1706 | 1750 | 5.35 | 4.32 | 4.46 | 39512223 | 39501653 | 39142991 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
46 | Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 | 8535519 | 8636471 | 8657365 |
47 | Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 | 7614893 | 7724031 | 7740745 |
48 | West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 | 1792147 | 1791420 | 1785526 |
49 | Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 | 5822434 | 5896271 | 5880101 |
50 | Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 | 578759 | 577605 | 579483 |
51 rows × 10 columns
Let's recompute incidence to make sure we know where the original CDC numbers came from.
From the CDC report: TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
$$\text{TB incidence} = \frac{\text{# TB cases in population}}{\text{# groups in population}} = \frac{\text{# TB cases in population}}{\text{population}/100000} $$$$= \frac{\text{# TB cases in population}}{\text{population}} \times 100000$$Let's try this for 2019:
tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
tb_census_df
U.S. jurisdiction | TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | 2019 | 2020 | 2021 | recompute incidence 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 | 4903185 | 5031362 | 5049846 | 1.77 |
1 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 | 731545 | 732923 | 734182 | 7.93 |
2 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 | 7278717 | 7179943 | 7264877 | 2.51 |
3 | Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 | 3017804 | 3014195 | 3028122 | 2.12 |
4 | California | 2111 | 1706 | 1750 | 5.35 | 4.32 | 4.46 | 39512223 | 39501653 | 39142991 | 5.34 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
46 | Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 | 8535519 | 8636471 | 8657365 | 2.24 |
47 | Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 | 7614893 | 7724031 | 7740745 | 2.90 |
48 | West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 | 1792147 | 1791420 | 1785526 | 0.50 |
49 | Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 | 5822434 | 5896271 | 5880101 | 0.88 |
50 | Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 | 578759 | 577605 | 579483 | 0.17 |
51 rows × 11 columns
Awesome!!!
Let's use a for-loop and Python format strings to compute TB incidence for all years. Python f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course (Python documentation).
# recompute incidence for all years
for year in [2019, 2020, 2021]:
tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
tb_census_df
U.S. jurisdiction | TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | 2019 | 2020 | 2021 | recompute incidence 2019 | recompute incidence 2020 | recompute incidence 2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 | 4903185 | 5031362 | 5049846 | 1.77 | 1.43 | 1.82 |
1 | Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 | 731545 | 732923 | 734182 | 7.93 | 7.91 | 7.90 |
2 | Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 | 7278717 | 7179943 | 7264877 | 2.51 | 1.89 | 1.78 |
3 | Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 | 3017804 | 3014195 | 3028122 | 2.12 | 1.96 | 2.28 |
4 | California | 2111 | 1706 | 1750 | 5.35 | 4.32 | 4.46 | 39512223 | 39501653 | 39142991 | 5.34 | 4.32 | 4.47 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
46 | Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 | 8535519 | 8636471 | 8657365 | 2.24 | 1.96 | 1.86 |
47 | Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 | 7614893 | 7724031 | 7740745 | 2.90 | 2.11 | 2.57 |
48 | West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 | 1792147 | 1791420 | 1785526 | 0.50 | 0.73 | 0.39 |
49 | Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 | 5822434 | 5896271 | 5880101 | 0.88 | 0.59 | 1.12 |
50 | Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 | 578759 | 577605 | 579483 | 0.17 | 0.00 | 0.52 |
51 rows × 13 columns
These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy. We'll leave it to you!
tb_census_df.describe()
TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | 2019 | 2020 | 2021 | recompute incidence 2019 | recompute incidence 2020 | recompute incidence 2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 |
mean | 174.51 | 140.65 | 154.12 | 2.10 | 1.78 | 1.97 | 6436069.08 | 6500225.73 | 6510422.63 | 2.10 | 1.78 | 1.97 |
std | 341.74 | 271.06 | 286.78 | 1.50 | 1.34 | 1.48 | 7360660.47 | 7408168.46 | 7394300.08 | 1.50 | 1.34 | 1.47 |
min | 1.00 | 0.00 | 2.00 | 0.17 | 0.00 | 0.21 | 578759.00 | 577605.00 | 579483.00 | 0.17 | 0.00 | 0.21 |
25% | 25.50 | 29.00 | 23.00 | 1.29 | 1.21 | 1.23 | 1789606.00 | 1820311.00 | 1844920.00 | 1.30 | 1.21 | 1.23 |
50% | 70.00 | 67.00 | 69.00 | 1.80 | 1.52 | 1.70 | 4467673.00 | 4507445.00 | 4506589.00 | 1.81 | 1.52 | 1.69 |
75% | 180.50 | 139.00 | 150.00 | 2.58 | 1.99 | 2.22 | 7446805.00 | 7451987.00 | 7502811.00 | 2.58 | 1.99 | 2.22 |
max | 2111.00 | 1706.00 | 1750.00 | 7.91 | 7.92 | 7.92 | 39512223.00 | 39501653.00 | 39142991.00 | 7.93 | 7.91 | 7.90 |
We likely won't get to this part, so a tutorial is provided for your own curiosity.
How do we reproduce that reported statistic in the original CDC report?
Reported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
This is TB incidence computed across the entire U.S. population! How do we reproduce this
tb_census_df
only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.Let's get exploring!
Before we keep exploring, I'm going to set all indexes to more meaningful values, instead of just numbers that pertained to some row at some point. This will make our cleaning slightly easier.
tb_df = tb_df.set_index("U.S. jurisdiction")
tb_df
TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | |
---|---|---|---|---|---|---|
U.S. jurisdiction | ||||||
Total | 8900 | 7173 | 7860 | 2.71 | 2.16 | 2.37 |
Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 |
Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 |
Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 |
Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 |
... | ... | ... | ... | ... | ... | ... |
Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 |
Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 |
West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 |
Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 |
Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 |
52 rows × 6 columns
census_2010s_df = census_2010s_df.set_index("Geographic Area")
census_2010s_df
2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|
Geographic Area | ||||||||||
United States | 309321666 | 311556874 | 313830990 | 315993715 | 318301008 | 320635163 | 322941311 | 324985539 | 326687501 | 328239523 |
Northeast | 55380134 | 55604223 | 55775216 | 55901806 | 56006011 | 56034684 | 56042330 | 56059240 | 56046620 | 55982803 |
Midwest | 66974416 | 67157800 | 67336743 | 67560379 | 67745167 | 67860583 | 67987540 | 68126781 | 68236628 | 68329004 |
South | 114866680 | 116006522 | 117241208 | 118364400 | 119624037 | 120997341 | 122351760 | 123542189 | 124569433 | 125580448 |
West | 72100436 | 72788329 | 73477823 | 74167130 | 74925793 | 75742555 | 76559681 | 77257329 | 77834820 | 78347268 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Washington | 6742830 | 6826627 | 6897058 | 6963985 | 7054655 | 7163657 | 7294771 | 7423362 | 7523869 | 7614893 |
West Virginia | 1854239 | 1856301 | 1856872 | 1853914 | 1849489 | 1842050 | 1831023 | 1817004 | 1804291 | 1792147 |
Wisconsin | 5690475 | 5705288 | 5719960 | 5736754 | 5751525 | 5760940 | 5772628 | 5790186 | 5807406 | 5822434 |
Wyoming | 564487 | 567299 | 576305 | 582122 | 582531 | 585613 | 584215 | 578931 | 577601 | 578759 |
Puerto Rico | 3721525 | 3678732 | 3634488 | 3593077 | 3534874 | 3473232 | 3406672 | 3325286 | 3193354 | 3193694 |
57 rows × 10 columns
census_2020s_df = census_2020s_df.set_index("Geographic Area")
census_2020s_df
2020 | 2021 | 2022 | |
---|---|---|---|
Geographic Area | |||
United States | 331511512 | 332031554 | 333287557 |
Northeast | 57448898 | 57259257 | 57040406 |
Midwest | 68961043 | 68836505 | 68787595 |
South | 126450613 | 127346029 | 128716192 |
West | 78650958 | 78589763 | 78743364 |
... | ... | ... | ... |
Washington | 7724031 | 7740745 | 7785786 |
West Virginia | 1791420 | 1785526 | 1775156 |
Wisconsin | 5896271 | 5880101 | 5892539 |
Wyoming | 577605 | 579483 | 581381 |
Puerto Rico | 3281557 | 3262693 | 3221789 |
57 rows × 3 columns
It turns out that our merge above only kept state records, even though our original tb_df
had the "Total" rolled record:
tb_df.head()
TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | |
---|---|---|---|---|---|---|
U.S. jurisdiction | ||||||
Total | 8900 | 7173 | 7860 | 2.71 | 2.16 | 2.37 |
Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 |
Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 |
Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 |
Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 |
Recall that merge by default does an inner merge by default, meaning that it only preserves keys that are present in both DataFrames.
The rolled records in our census dataframes have different Geographic Area
fields, which was the key we merged on:
census_2010s_df
2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|
Geographic Area | ||||||||||
United States | 309321666 | 311556874 | 313830990 | 315993715 | 318301008 | 320635163 | 322941311 | 324985539 | 326687501 | 328239523 |
Northeast | 55380134 | 55604223 | 55775216 | 55901806 | 56006011 | 56034684 | 56042330 | 56059240 | 56046620 | 55982803 |
Midwest | 66974416 | 67157800 | 67336743 | 67560379 | 67745167 | 67860583 | 67987540 | 68126781 | 68236628 | 68329004 |
South | 114866680 | 116006522 | 117241208 | 118364400 | 119624037 | 120997341 | 122351760 | 123542189 | 124569433 | 125580448 |
West | 72100436 | 72788329 | 73477823 | 74167130 | 74925793 | 75742555 | 76559681 | 77257329 | 77834820 | 78347268 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Washington | 6742830 | 6826627 | 6897058 | 6963985 | 7054655 | 7163657 | 7294771 | 7423362 | 7523869 | 7614893 |
West Virginia | 1854239 | 1856301 | 1856872 | 1853914 | 1849489 | 1842050 | 1831023 | 1817004 | 1804291 | 1792147 |
Wisconsin | 5690475 | 5705288 | 5719960 | 5736754 | 5751525 | 5760940 | 5772628 | 5790186 | 5807406 | 5822434 |
Wyoming | 564487 | 567299 | 576305 | 582122 | 582531 | 585613 | 584215 | 578931 | 577601 | 578759 |
Puerto Rico | 3721525 | 3678732 | 3634488 | 3593077 | 3534874 | 3473232 | 3406672 | 3325286 | 3193354 | 3193694 |
57 rows × 10 columns
The Census DataFrame has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use df.rename()
(documentation):
# rename rolled record for 2010s
census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
census_2010s_df
2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|
Geographic Area | ||||||||||
Total | 309321666 | 311556874 | 313830990 | 315993715 | 318301008 | 320635163 | 322941311 | 324985539 | 326687501 | 328239523 |
Northeast | 55380134 | 55604223 | 55775216 | 55901806 | 56006011 | 56034684 | 56042330 | 56059240 | 56046620 | 55982803 |
Midwest | 66974416 | 67157800 | 67336743 | 67560379 | 67745167 | 67860583 | 67987540 | 68126781 | 68236628 | 68329004 |
South | 114866680 | 116006522 | 117241208 | 118364400 | 119624037 | 120997341 | 122351760 | 123542189 | 124569433 | 125580448 |
West | 72100436 | 72788329 | 73477823 | 74167130 | 74925793 | 75742555 | 76559681 | 77257329 | 77834820 | 78347268 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Washington | 6742830 | 6826627 | 6897058 | 6963985 | 7054655 | 7163657 | 7294771 | 7423362 | 7523869 | 7614893 |
West Virginia | 1854239 | 1856301 | 1856872 | 1853914 | 1849489 | 1842050 | 1831023 | 1817004 | 1804291 | 1792147 |
Wisconsin | 5690475 | 5705288 | 5719960 | 5736754 | 5751525 | 5760940 | 5772628 | 5790186 | 5807406 | 5822434 |
Wyoming | 564487 | 567299 | 576305 | 582122 | 582531 | 585613 | 584215 | 578931 | 577601 | 578759 |
Puerto Rico | 3721525 | 3678732 | 3634488 | 3593077 | 3534874 | 3473232 | 3406672 | 3325286 | 3193354 | 3193694 |
57 rows × 10 columns
# same, but for 2020s rename rolled record
census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
census_2020s_df
2020 | 2021 | 2022 | |
---|---|---|---|
Geographic Area | |||
Total | 331511512 | 332031554 | 333287557 |
Northeast | 57448898 | 57259257 | 57040406 |
Midwest | 68961043 | 68836505 | 68787595 |
South | 126450613 | 127346029 | 128716192 |
West | 78650958 | 78589763 | 78743364 |
... | ... | ... | ... |
Washington | 7724031 | 7740745 | 7785786 |
West Virginia | 1791420 | 1785526 | 1775156 |
Wisconsin | 5896271 | 5880101 | 5892539 |
Wyoming | 577605 | 579483 | 581381 |
Puerto Rico | 3281557 | 3262693 | 3221789 |
57 rows × 3 columns
Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (df.merge()
documentation).
tb_census_df = (
tb_df
.merge(right=census_2010s_df[["2019"]],
left_index=True, right_index=True)
.merge(right=census_2020s_df[["2020", "2021"]],
left_index=True, right_index=True)
)
tb_census_df
TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | 2019 | 2020 | 2021 | |
---|---|---|---|---|---|---|---|---|---|
Total | 8900 | 7173 | 7860 | 2.71 | 2.16 | 2.37 | 328239523 | 331511512 | 332031554 |
Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 | 4903185 | 5031362 | 5049846 |
Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 | 731545 | 732923 | 734182 |
Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 | 7278717 | 7179943 | 7264877 |
Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 | 3017804 | 3014195 | 3028122 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 | 8535519 | 8636471 | 8657365 |
Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 | 7614893 | 7724031 | 7740745 |
West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 | 1792147 | 1791420 | 1785526 |
Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 | 5822434 | 5896271 | 5880101 |
Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 | 578759 | 577605 | 579483 |
52 rows × 9 columns
Finally, let's recompute our incidences:
# recompute incidence for all years
for year in [2019, 2020, 2021]:
tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
tb_census_df
TB cases 2019 | TB cases 2020 | TB cases 2021 | TB incidence 2019 | TB incidence 2020 | TB incidence 2021 | 2019 | 2020 | 2021 | recompute incidence 2019 | recompute incidence 2020 | recompute incidence 2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Total | 8900 | 7173 | 7860 | 2.71 | 2.16 | 2.37 | 328239523 | 331511512 | 332031554 | 2.71 | 2.16 | 2.37 |
Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 | 4903185 | 5031362 | 5049846 | 1.77 | 1.43 | 1.82 |
Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 | 731545 | 732923 | 734182 | 7.93 | 7.91 | 7.90 |
Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 | 7278717 | 7179943 | 7264877 | 2.51 | 1.89 | 1.78 |
Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 | 3017804 | 3014195 | 3028122 | 2.12 | 1.96 | 2.28 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Virginia | 191 | 169 | 161 | 2.23 | 1.96 | 1.86 | 8535519 | 8636471 | 8657365 | 2.24 | 1.96 | 1.86 |
Washington | 221 | 163 | 199 | 2.90 | 2.11 | 2.57 | 7614893 | 7724031 | 7740745 | 2.90 | 2.11 | 2.57 |
West Virginia | 9 | 13 | 7 | 0.50 | 0.73 | 0.39 | 1792147 | 1791420 | 1785526 | 0.50 | 0.73 | 0.39 |
Wisconsin | 51 | 35 | 66 | 0.88 | 0.59 | 1.12 | 5822434 | 5896271 | 5880101 | 0.88 | 0.59 | 1.12 |
Wyoming | 1 | 0 | 3 | 0.17 | 0.00 | 0.52 | 578759 | 577605 | 579483 | 0.17 | 0.00 | 0.52 |
52 rows × 12 columns
We reproduced the total U.S. incidences correctly!
We're almost there. Let's revisit the quote:
Reported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
Recall that percent change from $A$ to $B$ is computed as $$\text{percent change} = \frac{B - A}{A} \times 100$$.
incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
incidence_2020
2.1637257652759883
incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
incidence_2021
2.3672448914298068
difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
difference
9.405957511804143
We did it!!!