👩⚕️ Lecture 5 (Part 1, Tuberculosis) – Data 100, Spring 2025¶
Data 100, Spring 2025
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# Use 5 decimal places instead of scientific notation in pandas
pd.set_option('display.float_format', '{:.5f}'.format)
# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
🦠 Tuberculosis in the United States¶
What can we say about the presence of Tuberculosis in the United States?
Let's work with the data included in the original CDC article published in 2021.
We can explore the CSV file in many ways:
- Using the JupyterLab explorer tool (read-only!).
- Opening the CSV in DataHub, or Excel, or Google Sheets, etc.
- Inspecting the Python file object
- With
pandas
, usingpd.read_csv()
🧭 Methods 1 and 2: Play with the data in the JupyterLab Explorer and DataHub¶
To solidify the idea of a CSV as rectangular data (i.e., tabular data) stored as comma-separated values, let's start with the first two methods.
1. Use the file browser in JupyterLab to locate the CSV at data/cdc_tuberculosis.csv
, and double-click on it.
2. Right-click on the CSV in the file browser. Select Open With
--> Editor
. But, don't make any changes to the file!
🐍 Method 3: Play with the data in Python¶
Next, we will load in the data as a Python file object and inspect a couple lines.
With the code below, we can check out the first four lines of the CSV:
# Open the TB case count CSV, and print the first four lines
with open("data/cdc_tuberculosis.csv", "r") as f:
for i, row in enumerate(f):
print(row)
if i >= 3: break
,No. of TB cases,, U.S. jurisdiction,2019,2020,2021 Total,"8,900","7,173","7,860" Alabama,87,72,92
As expected, we have four lines of comma-separated values!
Why are there blank lines between each line of the CSV file?
You may recall that line breaks in text files are encoded with the special newline character
\n
.Python's
print()
function prints each line, interpreting the\n
at the end of each line as a newline, and also adds an additional newline.
We can use the repr()
("representation") function to return the raw string representation of each line (i.e., all special characters will be visible).
In other words,
print()
will not interpret\n
as a newline. Instead, it will literally print\n
.Note, though,
print()
adds a newline each time it is called. Otherwise, we would have one long string below instead of four lines.
# Open the TB case count CSV, and print the raw representation of
# the first four lines
with open("data/cdc_tuberculosis.csv", "r") as f:
for i, row in enumerate(f):
print(repr(row)) # print raw strings
if i >= 3: break
',No. of TB cases,,\n' 'U.S. jurisdiction,2019,2020,2021\n' 'Total,"8,900","7,173","7,860"\n' 'Alabama,87,72,92\n'
🐼 Method 4: Play with the data using pandas
¶
It's time for the tried-and-true Data 100 approach: pandas
.
tb_df = pd.read_csv("data/cdc_tuberculosis.csv",)
tb_df
Unnamed: 0 | No. of TB cases | Unnamed: 2 | Unnamed: 3 | |
---|---|---|---|---|
0 | U.S. jurisdiction | 2019 | 2020 | 2021 |
1 | Total | 8,900 | 7,173 | 7,860 |
2 | Alabama | 87 | 72 | 92 |
3 | Alaska | 58 | 58 | 58 |
4 | Arizona | 183 | 136 | 129 |
... | ... | ... | ... | ... |
48 | Virginia | 191 | 169 | 161 |
49 | Washington | 221 | 163 | 199 |
50 | West Virginia | 9 | 13 | 7 |
51 | Wisconsin | 51 | 35 | 66 |
52 | Wyoming | 1 | 0 | 3 |
53 rows × 4 columns
What's going on with the "Unnamed" column names?
And why does the first row look different than the other rows?
We're ready to wrangle the data!
A reasonable first step is to identify the row with the right header (i.e., the row with the column names).
The pd.read_csv()
function (documentation) has a convenient header
parameter for specifying the index of the row you want to use as the header:
# header=1 tells pandas to ignore row 0, and use row 1 as the column names
tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1)
tb_df
U.S. jurisdiction | 2019 | 2020 | 2021 | |
---|---|---|---|---|
0 | Total | 8,900 | 7,173 | 7,860 |
1 | Alabama | 87 | 72 | 92 |
2 | Alaska | 58 | 58 | 58 |
3 | Arizona | 183 | 136 | 129 |
4 | Arkansas | 64 | 59 | 69 |
... | ... | ... | ... | ... |
47 | Virginia | 191 | 169 | 161 |
48 | Washington | 221 | 163 | 199 |
49 | West Virginia | 9 | 13 | 7 |
50 | Wisconsin | 51 | 35 | 66 |
51 | Wyoming | 1 | 0 | 3 |
52 rows × 4 columns
Notice that we no longer have "Unnamed" columns.
Instructor note: Return to slides!
Notice that the first record (i.e., row 0) differs from the other records:
tb_df.head()
U.S. jurisdiction | 2019 | 2020 | 2021 | |
---|---|---|---|---|
0 | Total | 8,900 | 7,173 | 7,860 |
1 | Alabama | 87 | 72 | 92 |
2 | Alaska | 58 | 58 | 58 |
3 | Arizona | 183 | 136 | 129 |
4 | Arkansas | 64 | 59 | 69 |
Row 0 is what we call a rollup record, or a summary record.
The granularity of record 0 (i.e., the total counts summed over all states) differs from the granularity of the rest of the records (i.e., the counts for individual states).
Rollup records are often useful when displaying tables to humans. But, rollup records are generally less useful when using the data for further analysis, since the rollup record "overlaps" with other records (i.e., info from other rows is aggregated to create the rollup record).
Okay, EDA step two. How was the rollup record aggregated?
Let's check if total TB cases (row 0) is indeed the sum of all state TB cases (all other rows).
- To do this, we can drop row 0, and sum up all the remaining rows.
tb_df.drop(0)
U.S. jurisdiction | 2019 | 2020 | 2021 | |
---|---|---|---|---|
1 | Alabama | 87 | 72 | 92 |
2 | Alaska | 58 | 58 | 58 |
3 | Arizona | 183 | 136 | 129 |
4 | Arkansas | 64 | 59 | 69 |
5 | California | 2,111 | 1,706 | 1,750 |
... | ... | ... | ... | ... |
47 | Virginia | 191 | 169 | 161 |
48 | Washington | 221 | 163 | 199 |
49 | West Virginia | 9 | 13 | 7 |
50 | Wisconsin | 51 | 35 | 66 |
51 | Wyoming | 1 | 0 | 3 |
51 rows × 4 columns
tb_df.drop(0).sum()
U.S. jurisdiction AlabamaAlaskaArizonaArkansasCaliforniaColorado... 2019 8758183642,11166671824558302997326108523766881... 2020 7258136591,70652541719412221928216923937679917... 2021 9258129691,75058544319499228106425512749435786... dtype: object
This doesn't look very pretty!
Let's check out the column types:
tb_df.dtypes
U.S. jurisdiction object 2019 object 2020 object 2021 object dtype: object
The commas within the counts (e.g., 1,234
) cause pd.read_csv
to read in the counts as the object
datatype, or storage type. Strings are of the object
datatype.
- So,
pandas
is concatenating strings (e.g.,'1' + '2' = '12'
) instead of adding integers (e.g.,1 + 2 = 3
).
Fortunately read_csv
also has a thousands
parameter to handle exactly this issue (documentation)
- Note: This is not a fact that most data scientists would know off the top of their head! At this point, it would be very natural to Google/ask an LLM
How do I get pandas to ignore commas in numeric columns?
, and then learn about thethousands
parameter from the results.
tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
tb_df
U.S. jurisdiction | 2019 | 2020 | 2021 | |
---|---|---|---|---|
0 | Total | 8900 | 7173 | 7860 |
1 | Alabama | 87 | 72 | 92 |
2 | Alaska | 58 | 58 | 58 |
3 | Arizona | 183 | 136 | 129 |
4 | Arkansas | 64 | 59 | 69 |
... | ... | ... | ... | ... |
47 | Virginia | 191 | 169 | 161 |
48 | Washington | 221 | 163 | 199 |
49 | West Virginia | 9 | 13 | 7 |
50 | Wisconsin | 51 | 35 | 66 |
51 | Wyoming | 1 | 0 | 3 |
52 rows × 4 columns
Notice that there are no more commas in the first row!
Now, let's sum up the columns, ignoring the first row:
tb_df.drop(0).sum()
U.S. jurisdiction AlabamaAlaskaArizonaArkansasCaliforniaColorado... 2019 8900 2020 7173 2021 7860 dtype: object
Much better!
- Though you should note that string concatenation is still happening with the state names. To improve our code, we probably should not sum up the state name column. This exercise is left to you!
Finally, let's compare this output to the first row of the original data:
tb_df.head(1)
U.S. jurisdiction | 2019 | 2020 | 2021 | |
---|---|---|---|---|
0 | Total | 8900 | 7173 | 7860 |
The sums of the three TB cases columns are the same as the counts in the rollup record. Excellent!
Next, we will compute TB incidence for each state and the U.S. as a whole.
Instructor note: Return to the lecture!
🧺 Gather Census Data¶
Run the code in this section, but we won't review it during lecture.
- This section is a nice example of transforming data downloaded directly from a public website into a format that is convenient for analysis.
The code in this section transforms CSV files with U.S. Census population estimates (source (2010s), source (2020s)) into a form that is compatible with the TB case count data.
- We encourage you to explore the CSVs and study these lines outside of lecture.
There are a few new methods here:
df.convert_dtypes()
(documentation) conveniently converts all float dtypes into ints and is out of scope for the class.df.drop_na()
(documentation) drops rows containing any NA value. This function will be explained in more detail in a future lecture.
census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
# Notice we have more than just state data!
display(census_2010s_df.head(10))
# Also notice that the bottom of the file includes metadata (data about data).
# We want to ignore this!
display(census_2010s_df.tail())
Unnamed: 0 | Census | Estimates Base | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 308745538.00000 | 308758105.00000 | 309321666.00000 | 311556874.00000 | 313830990.00000 | 315993715.00000 | 318301008.00000 | 320635163.00000 | 322941311.00000 | 324985539.00000 | 326687501.00000 | 328239523.00000 |
1 | Northeast | 55317240.00000 | 55318443.00000 | 55380134.00000 | 55604223.00000 | 55775216.00000 | 55901806.00000 | 56006011.00000 | 56034684.00000 | 56042330.00000 | 56059240.00000 | 56046620.00000 | 55982803.00000 |
2 | Midwest | 66927001.00000 | 66929725.00000 | 66974416.00000 | 67157800.00000 | 67336743.00000 | 67560379.00000 | 67745167.00000 | 67860583.00000 | 67987540.00000 | 68126781.00000 | 68236628.00000 | 68329004.00000 |
3 | South | 114555744.00000 | 114563030.00000 | 114866680.00000 | 116006522.00000 | 117241208.00000 | 118364400.00000 | 119624037.00000 | 120997341.00000 | 122351760.00000 | 123542189.00000 | 124569433.00000 | 125580448.00000 |
4 | West | 71945553.00000 | 71946907.00000 | 72100436.00000 | 72788329.00000 | 73477823.00000 | 74167130.00000 | 74925793.00000 | 75742555.00000 | 76559681.00000 | 77257329.00000 | 77834820.00000 | 78347268.00000 |
5 | .Alabama | 4779736.00000 | 4780125.00000 | 4785437.00000 | 4799069.00000 | 4815588.00000 | 4830081.00000 | 4841799.00000 | 4852347.00000 | 4863525.00000 | 4874486.00000 | 4887681.00000 | 4903185.00000 |
6 | .Alaska | 710231.00000 | 710249.00000 | 713910.00000 | 722128.00000 | 730443.00000 | 737068.00000 | 736283.00000 | 737498.00000 | 741456.00000 | 739700.00000 | 735139.00000 | 731545.00000 |
7 | .Arizona | 6392017.00000 | 6392288.00000 | 6407172.00000 | 6472643.00000 | 6554978.00000 | 6632764.00000 | 6730413.00000 | 6829676.00000 | 6941072.00000 | 7044008.00000 | 7158024.00000 | 7278717.00000 |
8 | .Arkansas | 2915918.00000 | 2916031.00000 | 2921964.00000 | 2940667.00000 | 2952164.00000 | 2959400.00000 | 2967392.00000 | 2978048.00000 | 2989918.00000 | 3001345.00000 | 3009733.00000 | 3017804.00000 |
9 | .California | 37253956.00000 | 37254519.00000 | 37319502.00000 | 37638369.00000 | 37948800.00000 | 38260787.00000 | 38596972.00000 | 38918045.00000 | 39167117.00000 | 39358497.00000 | 39461588.00000 | 39512223.00000 |
Unnamed: 0 | Census | Estimates Base | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58 | Note: The estimates are based on the 2010 Cens... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
59 | Suggested Citation: | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
60 | Table 1. Annual Estimates of the Resident Popu... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
61 | Source: U.S. Census Bureau, Population Division | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
62 | Release Date: December 2019 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Here we do a bit more basic data cleaning:
census_2010s_df = (
census_2010s_df
.rename(columns={"Unnamed: 0": "Geographic Area"})
.drop(columns=["Census", "Estimates Base"])
.convert_dtypes() # "smart" converting of columns to int, use at your own risk
.dropna() # we'll introduce this very soon
)
census_2010s_df
Geographic Area | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 309321666 | 311556874 | 313830990 | 315993715 | 318301008 | 320635163 | 322941311 | 324985539 | 326687501 | 328239523 |
1 | Northeast | 55380134 | 55604223 | 55775216 | 55901806 | 56006011 | 56034684 | 56042330 | 56059240 | 56046620 | 55982803 |
2 | Midwest | 66974416 | 67157800 | 67336743 | 67560379 | 67745167 | 67860583 | 67987540 | 68126781 | 68236628 | 68329004 |
3 | South | 114866680 | 116006522 | 117241208 | 118364400 | 119624037 | 120997341 | 122351760 | 123542189 | 124569433 | 125580448 |
4 | West | 72100436 | 72788329 | 73477823 | 74167130 | 74925793 | 75742555 | 76559681 | 77257329 | 77834820 | 78347268 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
52 | .Washington | 6742830 | 6826627 | 6897058 | 6963985 | 7054655 | 7163657 | 7294771 | 7423362 | 7523869 | 7614893 |
53 | .West Virginia | 1854239 | 1856301 | 1856872 | 1853914 | 1849489 | 1842050 | 1831023 | 1817004 | 1804291 | 1792147 |
54 | .Wisconsin | 5690475 | 5705288 | 5719960 | 5736754 | 5751525 | 5760940 | 5772628 | 5790186 | 5807406 | 5822434 |
55 | .Wyoming | 564487 | 567299 | 576305 | 582122 | 582531 | 585613 | 584215 | 578931 | 577601 | 578759 |
57 | Puerto Rico | 3721525 | 3678732 | 3634488 | 3593077 | 3534874 | 3473232 | 3406672 | 3325286 | 3193354 | 3193694 |
57 rows × 11 columns
You might ask yourself: What is the granularity of each row in this table?
Notice there is a '.'
at the beginning of all the states. We need to remove that.
census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
census_2010s_df
Geographic Area | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 309321666 | 311556874 | 313830990 | 315993715 | 318301008 | 320635163 | 322941311 | 324985539 | 326687501 | 328239523 |
1 | Northeast | 55380134 | 55604223 | 55775216 | 55901806 | 56006011 | 56034684 | 56042330 | 56059240 | 56046620 | 55982803 |
2 | Midwest | 66974416 | 67157800 | 67336743 | 67560379 | 67745167 | 67860583 | 67987540 | 68126781 | 68236628 | 68329004 |
3 | South | 114866680 | 116006522 | 117241208 | 118364400 | 119624037 | 120997341 | 122351760 | 123542189 | 124569433 | 125580448 |
4 | West | 72100436 | 72788329 | 73477823 | 74167130 | 74925793 | 75742555 | 76559681 | 77257329 | 77834820 | 78347268 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
52 | Washington | 6742830 | 6826627 | 6897058 | 6963985 | 7054655 | 7163657 | 7294771 | 7423362 | 7523869 | 7614893 |
53 | West Virginia | 1854239 | 1856301 | 1856872 | 1853914 | 1849489 | 1842050 | 1831023 | 1817004 | 1804291 | 1792147 |
54 | Wisconsin | 5690475 | 5705288 | 5719960 | 5736754 | 5751525 | 5760940 | 5772628 | 5790186 | 5807406 | 5822434 |
55 | Wyoming | 564487 | 567299 | 576305 | 582122 | 582531 | 585613 | 584215 | 578931 | 577601 | 578759 |
57 | Puerto Rico | 3721525 | 3678732 | 3634488 | 3593077 | 3534874 | 3473232 | 3406672 | 3325286 | 3193354 | 3193694 |
57 rows × 11 columns
The 2020s data is in a separate file.
So, we will repeat the same data cleaning process on the 2020s dataset.
- Even better, we could write a re-usable function to carry out the similar cleaning process for both datasets. For this demo, we will use the same code twice.
# census 2020s data
census_2020s_df = pd.read_csv("data/NST-EST2024-POP.csv", header=3, thousands=",")
# Once again, we have more than just state data, and metadata at the bottom.
# But, we also have a ton of extra blank columns!
display(census_2020s_df.head(10))
display(census_2020s_df.tail())
Unnamed: 0 | Unnamed: 1 | 2020 | 2021 | 2022 | 2023 | 2024 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Unnamed: 29 | Unnamed: 30 | Unnamed: 31 | Unnamed: 32 | Unnamed: 33 | Unnamed: 34 | Unnamed: 35 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 331515736.00000 | 331577720.00000 | 332099760.00000 | 334017321.00000 | 336806231.00000 | 340110988.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Northeast | 57617706.00000 | 57431458.00000 | 57252533.00000 | 57159597.00000 | 57398303.00000 | 57832935.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Midwest | 68998970.00000 | 68984258.00000 | 68872831.00000 | 68903297.00000 | 69186401.00000 | 69596584.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | South | 126281537.00000 | 126476549.00000 | 127368010.00000 | 129037849.00000 | 130893358.00000 | 132665693.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | West | 78617523.00000 | 78685455.00000 | 78606386.00000 | 78916578.00000 | 79328169.00000 | 80015776.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | .Alabama | 5025369.00000 | 5033094.00000 | 5049196.00000 | 5076181.00000 | 5117673.00000 | 5157699.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | .Alaska | 733395.00000 | 733017.00000 | 734420.00000 | 734442.00000 | 736510.00000 | 740133.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | .Arizona | 7158110.00000 | 7187135.00000 | 7274078.00000 | 7377566.00000 | 7473027.00000 | 7582384.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | .Arkansas | 3011553.00000 | 3014546.00000 | 3026870.00000 | 3047704.00000 | 3069463.00000 | 3088354.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | .California | 39555674.00000 | 39521958.00000 | 39142565.00000 | 39142414.00000 | 39198693.00000 | 39431263.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Unnamed: 0 | Unnamed: 1 | 2020 | 2021 | 2022 | 2023 | 2024 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Unnamed: 29 | Unnamed: 30 | Unnamed: 31 | Unnamed: 32 | Unnamed: 33 | Unnamed: 34 | Unnamed: 35 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
59 | Note: The estimates are developed from a base ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
60 | Suggested Citation: | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
61 | Annual Estimates of the Resident Population fo... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
62 | Source: U.S. Census Bureau, Population Division | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
63 | Release Date: December 2024 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# census 2020s data
census_2020s_df = (
census_2020s_df
.drop(columns=["Unnamed: 1"])
.rename(columns={"Unnamed: 0": "Geographic Area"})
# ignore all the blank extra columns
.loc[:, "Geographic Area":"2024"]
.convert_dtypes()
.dropna()
)
census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
census_2020s_df
Geographic Area | 2020 | 2021 | 2022 | 2023 | 2024 | |
---|---|---|---|---|---|---|
0 | United States | 331577720 | 332099760 | 334017321 | 336806231 | 340110988 |
1 | Northeast | 57431458 | 57252533 | 57159597 | 57398303 | 57832935 |
2 | Midwest | 68984258 | 68872831 | 68903297 | 69186401 | 69596584 |
3 | South | 126476549 | 127368010 | 129037849 | 130893358 | 132665693 |
4 | West | 78685455 | 78606386 | 78916578 | 79328169 | 80015776 |
... | ... | ... | ... | ... | ... | ... |
52 | Washington | 7727209 | 7743760 | 7794123 | 7857320 | 7958180 |
53 | West Virginia | 1791646 | 1785618 | 1774122 | 1770495 | 1769979 |
54 | Wisconsin | 5897375 | 5881608 | 5903975 | 5930405 | 5960975 |
55 | Wyoming | 577681 | 579636 | 581978 | 585067 | 587618 |
57 | Puerto Rico | 3281590 | 3262711 | 3220137 | 3203792 | 3203295 |
57 rows × 6 columns
With that, we're in business!
We now have U.S. Census data from 2019, 2020, and 2021 in a format that is compatible with our TB case count data.
Time to merge
our datasets (i.e., join them)!
# Show the three tables that we are going to join.
# To keep things simple, let's just look at the last two rows of each df.
display(tb_df.tail(2))
display(census_2010s_df.tail(2))
display(census_2020s_df.tail(2))
U.S. jurisdiction | 2019 | 2020 | 2021 | |
---|---|---|---|---|
50 | Wisconsin | 51 | 35 | 66 |
51 | Wyoming | 1 | 0 | 3 |
Geographic Area | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|
55 | Wyoming | 564487 | 567299 | 576305 | 582122 | 582531 | 585613 | 584215 | 578931 | 577601 | 578759 |
57 | Puerto Rico | 3721525 | 3678732 | 3634488 | 3593077 | 3534874 | 3473232 | 3406672 | 3325286 | 3193354 | 3193694 |
Geographic Area | 2020 | 2021 | 2022 | 2023 | 2024 | |
---|---|---|---|---|---|---|
55 | Wyoming | 577681 | 579636 | 581978 | 585067 | 587618 |
57 | Puerto Rico | 3281590 | 3262711 | 3220137 | 3203792 | 3203295 |
We're only interested in the population for the years 2019, 2020, and 2021, so let's select just those columns:
# Select only the relevant population years
census_2019_df = census_2010s_df[['Geographic Area', '2019']]
census_2020_2021_df = census_2020s_df[['Geographic Area', '2020', '2021']]
display(tb_df.tail(2))
display(census_2019_df.tail(2))
display(census_2020_2021_df.tail(2))
U.S. jurisdiction | 2019 | 2020 | 2021 | |
---|---|---|---|---|
50 | Wisconsin | 51 | 35 | 66 |
51 | Wyoming | 1 | 0 | 3 |
Geographic Area | 2019 | |
---|---|---|
55 | Wyoming | 578759 |
57 | Puerto Rico | 3193694 |
Geographic Area | 2020 | 2021 | |
---|---|---|---|
55 | Wyoming | 577681 | 579636 |
57 | Puerto Rico | 3281590 | 3262711 |
All three dataframes have a column containing U.S. states, along with some other geographic areas. These columns are our join keys.
Below, we use
df1.merge(right=df2, ...)
to carry out the merge (documentation).We could have alternatively used the function
pd.merge(left=df1, right=df2, ...)
(documentation).
# merge TB dataframe with two US census dataframes
tb_census_df = (
tb_df
.merge(right=census_2019_df,
left_on="U.S. jurisdiction", right_on="Geographic Area")
.merge(right=census_2020_2021_df,
left_on="U.S. jurisdiction", right_on="Geographic Area")
)
tb_census_df.tail(2)
U.S. jurisdiction | 2019_x | 2020_x | 2021_x | Geographic Area_x | 2019_y | Geographic Area_y | 2020_y | 2021_y | |
---|---|---|---|---|---|---|---|---|---|
49 | Wisconsin | 51 | 35 | 66 | Wisconsin | 5822434 | Wisconsin | 5897375 | 5881608 |
50 | Wyoming | 1 | 0 | 3 | Wyoming | 578759 | Wyoming | 577681 | 579636 |
To see what's going on with the duplicate columns, and the _x
and _y
, let's do the just the first merge:
tb_df.merge(right=census_2019_df,
left_on="U.S. jurisdiction",
right_on="Geographic Area").head()
U.S. jurisdiction | 2019_x | 2020 | 2021 | Geographic Area | 2019_y | |
---|---|---|---|---|---|---|
0 | Alabama | 87 | 72 | 92 | Alabama | 4903185 |
1 | Alaska | 58 | 58 | 58 | Alaska | 731545 |
2 | Arizona | 183 | 136 | 129 | Arizona | 7278717 |
3 | Arkansas | 64 | 59 | 69 | Arkansas | 3017804 |
4 | California | 2111 | 1706 | 1750 | California | 39512223 |
Notice that the columns containing the join keys have all been retained, and all contain the same values.
Furthermore, notice that the duplicated columns are appended with
_x
and_y
to keep the column names unique.In the TB case count data, column
2019
represents the number of TB cases in 2019, but in the Census data, column2019
represents the U.S. population.
We can use the suffixes
argument to modify the _x
and _y
defaults to our liking (documentation).
# Specify the suffixes to use for duplicated column names
tb_df.merge(right=census_2019_df,
left_on="U.S. jurisdiction",
right_on="Geographic Area",
suffixes=('_cases', '_population')).head()
U.S. jurisdiction | 2019_cases | 2020 | 2021 | Geographic Area | 2019_population | |
---|---|---|---|---|---|---|
0 | Alabama | 87 | 72 | 92 | Alabama | 4903185 |
1 | Alaska | 58 | 58 | 58 | Alaska | 731545 |
2 | Arizona | 183 | 136 | 129 | Arizona | 7278717 |
3 | Arkansas | 64 | 59 | 69 | Arkansas | 3017804 |
4 | California | 2111 | 1706 | 1750 | California | 39512223 |
Notice the _x
and _y
have changed to _cases
and _population
, just like we specified.
Putting it all together, and dropping the duplicated Geographic Area
columns:
# Redux: merge TB dataframe with two US census dataframes
tb_census_df = (
tb_df
.merge(right=census_2019_df,
left_on="U.S. jurisdiction", right_on="Geographic Area",
suffixes=('_cases', '_population'))
.drop(columns="Geographic Area")
.merge(right=census_2020_2021_df,
left_on="U.S. jurisdiction", right_on="Geographic Area",
suffixes=('_cases', '_population'))
.drop(columns="Geographic Area")
)
tb_census_df.tail(2)
U.S. jurisdiction | 2019_cases | 2020_cases | 2021_cases | 2019_population | 2020_population | 2021_population | |
---|---|---|---|---|---|---|---|
49 | Wisconsin | 51 | 35 | 66 | 5822434 | 5897375 | 5881608 |
50 | Wyoming | 1 | 0 | 3 | 578759 | 577681 | 579636 |
♻️ Reproduce incidence¶
Let's see if we can reproduce the original CDC numbers from our augmented dataset of TB case counts and state populations.
Recall that the nationwide TB incidence was 2.7 in 2019, 2.2 in 2020, and 2.4 in 2021.
Along the way, we'll also compute state-level incidence.
From the CDC report: TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
Let's start with a simpler question: What is the per person incidence?
- In other words, what is the probability that a randomly selected person in the population had TB within a given year?
$$\text{TB incidence per person} = \frac{\text{\# TB cases in population}}{\text{Total population size}}$$
Let's calculate per person incidence for 2019:
# Calculate per person incidence for 2019
tb_census_df["per person incidence 2019"] = (
tb_census_df["2019_cases"]/tb_census_df["2019_population"]
)
tb_census_df
U.S. jurisdiction | 2019_cases | 2020_cases | 2021_cases | 2019_population | 2020_population | 2021_population | per person incidence 2019 | |
---|---|---|---|---|---|---|---|---|
0 | Alabama | 87 | 72 | 92 | 4903185 | 5033094 | 5049196 | 0.00002 |
1 | Alaska | 58 | 58 | 58 | 731545 | 733017 | 734420 | 0.00008 |
2 | Arizona | 183 | 136 | 129 | 7278717 | 7187135 | 7274078 | 0.00003 |
3 | Arkansas | 64 | 59 | 69 | 3017804 | 3014546 | 3026870 | 0.00002 |
4 | California | 2111 | 1706 | 1750 | 39512223 | 39521958 | 39142565 | 0.00005 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
46 | Virginia | 191 | 169 | 161 | 8535519 | 8637615 | 8658910 | 0.00002 |
47 | Washington | 221 | 163 | 199 | 7614893 | 7727209 | 7743760 | 0.00003 |
48 | West Virginia | 9 | 13 | 7 | 1792147 | 1791646 | 1785618 | 0.00001 |
49 | Wisconsin | 51 | 35 | 66 | 5822434 | 5897375 | 5881608 | 0.00001 |
50 | Wyoming | 1 | 0 | 3 | 578759 | 577681 | 579636 | 0.00000 |
51 rows × 8 columns
TB is really rare in the United States, so per person TB incidence is really low, as expected.
But, if we were to consider 100,000 people, the probability of seeing a TB case is higher.
In fact, it would be 100,000 times higher!
$$\text{TB incidence per 100,000} = \text{100,000} * \text{TB incidence per person}$$
# To help read bigger numbers in Python, you can use _ to separate thousands,
# akin to using commas. 100_000 is the same as writing 100000, but more readable.
tb_census_df["per 100k incidence 2019"] = (
100_000 * tb_census_df["per person incidence 2019"]
)
tb_census_df
U.S. jurisdiction | 2019_cases | 2020_cases | 2021_cases | 2019_population | 2020_population | 2021_population | per person incidence 2019 | per 100k incidence 2019 | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 87 | 72 | 92 | 4903185 | 5033094 | 5049196 | 0.00002 | 1.77436 |
1 | Alaska | 58 | 58 | 58 | 731545 | 733017 | 734420 | 0.00008 | 7.92843 |
2 | Arizona | 183 | 136 | 129 | 7278717 | 7187135 | 7274078 | 0.00003 | 2.51418 |
3 | Arkansas | 64 | 59 | 69 | 3017804 | 3014546 | 3026870 | 0.00002 | 2.12075 |
4 | California | 2111 | 1706 | 1750 | 39512223 | 39521958 | 39142565 | 0.00005 | 5.34265 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
46 | Virginia | 191 | 169 | 161 | 8535519 | 8637615 | 8658910 | 0.00002 | 2.23771 |
47 | Washington | 221 | 163 | 199 | 7614893 | 7727209 | 7743760 | 0.00003 | 2.90221 |
48 | West Virginia | 9 | 13 | 7 | 1792147 | 1791646 | 1785618 | 0.00001 | 0.50219 |
49 | Wisconsin | 51 | 35 | 66 | 5822434 | 5897375 | 5881608 | 0.00001 | 0.87592 |
50 | Wyoming | 1 | 0 | 3 | 578759 | 577681 | 579636 | 0.00000 | 0.17278 |
51 rows × 9 columns
Now we're seeing more human-readable values.
- For example, there 5.3 tuberculosis cases for every 100,000 California residents in 2019.
To wrap up this exercise, let's calculate the nationwide incidence of TB in 2019.
# Recall that the CDC reported an incidence of 2.7 per 100,000 in 2019.
tot_tb_cases_50_states = tb_census_df["2019_cases"].sum()
tot_pop_50_states = tb_census_df["2019_population"].sum()
tb_per_100k_50_states = 100_000 * tot_tb_cases_50_states / tot_pop_50_states
tb_per_100k_50_states
2.7114346007625656
We can use a for
loop to compute the incidence for 2019, 2020, and 2021.
- You'll notice that we get the same numbers reported by the CDC!
# f strings (f"...") are a handy way to pass in variables to strings.
for year in [2019, 2020, 2021]:
tot_tb_cases_50_states = tb_census_df[f"{year}_cases"].sum()
tot_pop_50_states = tb_census_df[f"{year}_population"].sum()
tb_per_100k_50_states = 100_000 * tot_tb_cases_50_states / tot_pop_50_states
print(tb_per_100k_50_states)
2.7114346007625656 2.163293721906285 2.366758711298075
Instructor Note: Return to Slides!