👩‍⚕️ Lecture 5 (Part 1, Tuberculosis) – Data 100, Spring 2025¶

Data 100, Spring 2025

Acknowledgments Page

In [1]:
import numpy as np
import pandas as pd
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# Use 5 decimal places instead of scientific notation in pandas
pd.set_option('display.float_format', '{:.5f}'.format)

# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

🦠 Tuberculosis in the United States¶

What can we say about the presence of Tuberculosis in the United States?

Let's work with the data included in the original CDC article published in 2021.



📖 Reading CSVs¶

The TB case count data is saved as a CSV file located at data/cdc_tuberculosis.csv.

We can explore the CSV file in many ways:

  1. Using the JupyterLab explorer tool (read-only!).
  2. Opening the CSV in DataHub, or Excel, or Google Sheets, etc.
  3. Inspecting the Python file object
  4. With pandas, using pd.read_csv()


🧭 Methods 1 and 2: Play with the data in the JupyterLab Explorer and DataHub¶

To solidify the idea of a CSV as rectangular data (i.e., tabular data) stored as comma-separated values, let's start with the first two methods.

1. Use the file browser in JupyterLab to locate the CSV at data/cdc_tuberculosis.csv, and double-click on it.

2. Right-click on the CSV in the file browser. Select Open With --> Editor. But, don't make any changes to the file!



🐍 Method 3: Play with the data in Python¶

Next, we will load in the data as a Python file object and inspect a couple lines.

With the code below, we can check out the first four lines of the CSV:

In [3]:
# Open the TB case count CSV, and print the first four lines
with open("data/cdc_tuberculosis.csv", "r") as f:
    for i, row in enumerate(f):
        print(row)
        if i >= 3: break
,No. of TB cases,,

U.S. jurisdiction,2019,2020,2021

Total,"8,900","7,173","7,860"

Alabama,87,72,92

As expected, we have four lines of comma-separated values!

Why are there blank lines between each line of the CSV file?

You may recall that line breaks in text files are encoded with the special newline character \n.

Python's print() function prints each line, interpreting the \n at the end of each line as a newline, and also adds an additional newline.

We can use the repr() ("representation") function to return the raw string representation of each line (i.e., all special characters will be visible).

  • In other words, print() will not interpret \n as a newline. Instead, it will literally print \n.

  • Note, though, print() adds a newline each time it is called. Otherwise, we would have one long string below instead of four lines.

In [4]:
# Open the TB case count CSV, and print the raw representation of
# the first four lines
with open("data/cdc_tuberculosis.csv", "r") as f:
    for i, row in enumerate(f):
        print(repr(row)) # print raw strings
        if i >= 3: break
',No. of TB cases,,\n'
'U.S. jurisdiction,2019,2020,2021\n'
'Total,"8,900","7,173","7,860"\n'
'Alabama,87,72,92\n'



🐼 Method 4: Play with the data using pandas¶

It's time for the tried-and-true Data 100 approach: pandas.

In [5]:
tb_df = pd.read_csv("data/cdc_tuberculosis.csv",)
tb_df
Out[5]:
Unnamed: 0 No. of TB cases Unnamed: 2 Unnamed: 3
0 U.S. jurisdiction 2019 2020 2021
1 Total 8,900 7,173 7,860
2 Alabama 87 72 92
3 Alaska 58 58 58
4 Arizona 183 136 129
... ... ... ... ...
48 Virginia 191 169 161
49 Washington 221 163 199
50 West Virginia 9 13 7
51 Wisconsin 51 35 66
52 Wyoming 1 0 3

53 rows × 4 columns

What's going on with the "Unnamed" column names?

And why does the first row look different than the other rows?

We're ready to wrangle the data!

A reasonable first step is to identify the row with the right header (i.e., the row with the column names).

The pd.read_csv() function (documentation) has a convenient header parameter for specifying the index of the row you want to use as the header:

In [6]:
# header=1 tells pandas to ignore row 0, and use row 1 as the column names
tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1)
tb_df
Out[6]:
U.S. jurisdiction 2019 2020 2021
0 Total 8,900 7,173 7,860
1 Alabama 87 72 92
2 Alaska 58 58 58
3 Arizona 183 136 129
4 Arkansas 64 59 69
... ... ... ... ...
47 Virginia 191 169 161
48 Washington 221 163 199
49 West Virginia 9 13 7
50 Wisconsin 51 35 66
51 Wyoming 1 0 3

52 rows × 4 columns

Notice that we no longer have "Unnamed" columns.


Instructor note: Return to slides!








🔎 Granularity of records¶

Notice that the first record (i.e., row 0) differs from the other records:

In [7]:
tb_df.head()
Out[7]:
U.S. jurisdiction 2019 2020 2021
0 Total 8,900 7,173 7,860
1 Alabama 87 72 92
2 Alaska 58 58 58
3 Arizona 183 136 129
4 Arkansas 64 59 69

Row 0 is what we call a rollup record, or a summary record.

  • The granularity of record 0 (i.e., the total counts summed over all states) differs from the granularity of the rest of the records (i.e., the counts for individual states).

  • Rollup records are often useful when displaying tables to humans. But, rollup records are generally less useful when using the data for further analysis, since the rollup record "overlaps" with other records (i.e., info from other rows is aggregated to create the rollup record).


Okay, EDA step two. How was the rollup record aggregated?

Let's check if total TB cases (row 0) is indeed the sum of all state TB cases (all other rows).

  • To do this, we can drop row 0, and sum up all the remaining rows.
In [8]:
tb_df.drop(0)
Out[8]:
U.S. jurisdiction 2019 2020 2021
1 Alabama 87 72 92
2 Alaska 58 58 58
3 Arizona 183 136 129
4 Arkansas 64 59 69
5 California 2,111 1,706 1,750
... ... ... ... ...
47 Virginia 191 169 161
48 Washington 221 163 199
49 West Virginia 9 13 7
50 Wisconsin 51 35 66
51 Wyoming 1 0 3

51 rows × 4 columns

In [9]:
tb_df.drop(0).sum()
Out[9]:
U.S. jurisdiction    AlabamaAlaskaArizonaArkansasCaliforniaColorado...
2019                 8758183642,11166671824558302997326108523766881...
2020                 7258136591,70652541719412221928216923937679917...
2021                 9258129691,75058544319499228106425512749435786...
dtype: object


This doesn't look very pretty!

Let's check out the column types:

In [10]:
tb_df.dtypes
Out[10]:
U.S. jurisdiction    object
2019                 object
2020                 object
2021                 object
dtype: object


The commas within the counts (e.g., 1,234) cause pd.read_csv to read in the counts as the object datatype, or storage type. Strings are of the object datatype.

  • So, pandas is concatenating strings (e.g., '1' + '2' = '12') instead of adding integers (e.g., 1 + 2 = 3).


Fortunately read_csv also has a thousands parameter to handle exactly this issue (documentation)

  • Note: This is not a fact that most data scientists would know off the top of their head! At this point, it would be very natural to Google/ask an LLM How do I get pandas to ignore commas in numeric columns?, and then learn about the thousands parameter from the results.
In [11]:
tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
tb_df
Out[11]:
U.S. jurisdiction 2019 2020 2021
0 Total 8900 7173 7860
1 Alabama 87 72 92
2 Alaska 58 58 58
3 Arizona 183 136 129
4 Arkansas 64 59 69
... ... ... ... ...
47 Virginia 191 169 161
48 Washington 221 163 199
49 West Virginia 9 13 7
50 Wisconsin 51 35 66
51 Wyoming 1 0 3

52 rows × 4 columns

Notice that there are no more commas in the first row!

Now, let's sum up the columns, ignoring the first row:

In [12]:
tb_df.drop(0).sum()
Out[12]:
U.S. jurisdiction    AlabamaAlaskaArizonaArkansasCaliforniaColorado...
2019                                                              8900
2020                                                              7173
2021                                                              7860
dtype: object

Much better!

  • Though you should note that string concatenation is still happening with the state names. To improve our code, we probably should not sum up the state name column. This exercise is left to you!

Finally, let's compare this output to the first row of the original data:

In [13]:
tb_df.head(1)
Out[13]:
U.S. jurisdiction 2019 2020 2021
0 Total 8900 7173 7860

The sums of the three TB cases columns are the same as the counts in the rollup record. Excellent!

Next, we will compute TB incidence for each state and the U.S. as a whole.

Instructor note: Return to the lecture!






🧺 Gather Census Data¶

Run the code in this section, but we won't review it during lecture.

  • This section is a nice example of transforming data downloaded directly from a public website into a format that is convenient for analysis.

The code in this section transforms CSV files with U.S. Census population estimates (source (2010s), source (2020s)) into a form that is compatible with the TB case count data.

  • We encourage you to explore the CSVs and study these lines outside of lecture.

There are a few new methods here:

  • df.convert_dtypes() (documentation) conveniently converts all float dtypes into ints and is out of scope for the class.
  • df.drop_na() (documentation) drops rows containing any NA value. This function will be explained in more detail in a future lecture.
In [14]:
census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")

# Notice we have more than just state data!
display(census_2010s_df.head(10))

# Also notice that the bottom of the file includes metadata (data about data).
# We want to ignore this!
display(census_2010s_df.tail())
Unnamed: 0 Census Estimates Base 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 United States 308745538.00000 308758105.00000 309321666.00000 311556874.00000 313830990.00000 315993715.00000 318301008.00000 320635163.00000 322941311.00000 324985539.00000 326687501.00000 328239523.00000
1 Northeast 55317240.00000 55318443.00000 55380134.00000 55604223.00000 55775216.00000 55901806.00000 56006011.00000 56034684.00000 56042330.00000 56059240.00000 56046620.00000 55982803.00000
2 Midwest 66927001.00000 66929725.00000 66974416.00000 67157800.00000 67336743.00000 67560379.00000 67745167.00000 67860583.00000 67987540.00000 68126781.00000 68236628.00000 68329004.00000
3 South 114555744.00000 114563030.00000 114866680.00000 116006522.00000 117241208.00000 118364400.00000 119624037.00000 120997341.00000 122351760.00000 123542189.00000 124569433.00000 125580448.00000
4 West 71945553.00000 71946907.00000 72100436.00000 72788329.00000 73477823.00000 74167130.00000 74925793.00000 75742555.00000 76559681.00000 77257329.00000 77834820.00000 78347268.00000
5 .Alabama 4779736.00000 4780125.00000 4785437.00000 4799069.00000 4815588.00000 4830081.00000 4841799.00000 4852347.00000 4863525.00000 4874486.00000 4887681.00000 4903185.00000
6 .Alaska 710231.00000 710249.00000 713910.00000 722128.00000 730443.00000 737068.00000 736283.00000 737498.00000 741456.00000 739700.00000 735139.00000 731545.00000
7 .Arizona 6392017.00000 6392288.00000 6407172.00000 6472643.00000 6554978.00000 6632764.00000 6730413.00000 6829676.00000 6941072.00000 7044008.00000 7158024.00000 7278717.00000
8 .Arkansas 2915918.00000 2916031.00000 2921964.00000 2940667.00000 2952164.00000 2959400.00000 2967392.00000 2978048.00000 2989918.00000 3001345.00000 3009733.00000 3017804.00000
9 .California 37253956.00000 37254519.00000 37319502.00000 37638369.00000 37948800.00000 38260787.00000 38596972.00000 38918045.00000 39167117.00000 39358497.00000 39461588.00000 39512223.00000
Unnamed: 0 Census Estimates Base 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
58 Note: The estimates are based on the 2010 Cens... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
59 Suggested Citation: NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
60 Table 1. Annual Estimates of the Resident Popu... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
61 Source: U.S. Census Bureau, Population Division NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62 Release Date: December 2019 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Here we do a bit more basic data cleaning:

In [15]:
census_2010s_df = (
    census_2010s_df
    .rename(columns={"Unnamed: 0": "Geographic Area"})
    .drop(columns=["Census", "Estimates Base"])
    .convert_dtypes() # "smart" converting of columns to int, use at your own risk
    .dropna()  # we'll introduce this very soon
)
census_2010s_df
Out[15]:
Geographic Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 United States 309321666 311556874 313830990 315993715 318301008 320635163 322941311 324985539 326687501 328239523
1 Northeast 55380134 55604223 55775216 55901806 56006011 56034684 56042330 56059240 56046620 55982803
2 Midwest 66974416 67157800 67336743 67560379 67745167 67860583 67987540 68126781 68236628 68329004
3 South 114866680 116006522 117241208 118364400 119624037 120997341 122351760 123542189 124569433 125580448
4 West 72100436 72788329 73477823 74167130 74925793 75742555 76559681 77257329 77834820 78347268
... ... ... ... ... ... ... ... ... ... ... ...
52 .Washington 6742830 6826627 6897058 6963985 7054655 7163657 7294771 7423362 7523869 7614893
53 .West Virginia 1854239 1856301 1856872 1853914 1849489 1842050 1831023 1817004 1804291 1792147
54 .Wisconsin 5690475 5705288 5719960 5736754 5751525 5760940 5772628 5790186 5807406 5822434
55 .Wyoming 564487 567299 576305 582122 582531 585613 584215 578931 577601 578759
57 Puerto Rico 3721525 3678732 3634488 3593077 3534874 3473232 3406672 3325286 3193354 3193694

57 rows × 11 columns

You might ask yourself: What is the granularity of each row in this table?

Notice there is a '.' at the beginning of all the states. We need to remove that.

In [16]:
census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
census_2010s_df
Out[16]:
Geographic Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 United States 309321666 311556874 313830990 315993715 318301008 320635163 322941311 324985539 326687501 328239523
1 Northeast 55380134 55604223 55775216 55901806 56006011 56034684 56042330 56059240 56046620 55982803
2 Midwest 66974416 67157800 67336743 67560379 67745167 67860583 67987540 68126781 68236628 68329004
3 South 114866680 116006522 117241208 118364400 119624037 120997341 122351760 123542189 124569433 125580448
4 West 72100436 72788329 73477823 74167130 74925793 75742555 76559681 77257329 77834820 78347268
... ... ... ... ... ... ... ... ... ... ... ...
52 Washington 6742830 6826627 6897058 6963985 7054655 7163657 7294771 7423362 7523869 7614893
53 West Virginia 1854239 1856301 1856872 1853914 1849489 1842050 1831023 1817004 1804291 1792147
54 Wisconsin 5690475 5705288 5719960 5736754 5751525 5760940 5772628 5790186 5807406 5822434
55 Wyoming 564487 567299 576305 582122 582531 585613 584215 578931 577601 578759
57 Puerto Rico 3721525 3678732 3634488 3593077 3534874 3473232 3406672 3325286 3193354 3193694

57 rows × 11 columns

The 2020s data is in a separate file.

So, we will repeat the same data cleaning process on the 2020s dataset.

  • Even better, we could write a re-usable function to carry out the similar cleaning process for both datasets. For this demo, we will use the same code twice.
In [17]:
# census 2020s data
census_2020s_df = pd.read_csv("data/NST-EST2024-POP.csv", header=3, thousands=",")

# Once again, we have more than just state data, and metadata at the bottom.
# But, we also have a ton of extra blank columns!
display(census_2020s_df.head(10))
display(census_2020s_df.tail())
Unnamed: 0 Unnamed: 1 2020 2021 2022 2023 2024 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Unnamed: 29 Unnamed: 30 Unnamed: 31 Unnamed: 32 Unnamed: 33 Unnamed: 34 Unnamed: 35
0 United States 331515736.00000 331577720.00000 332099760.00000 334017321.00000 336806231.00000 340110988.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Northeast 57617706.00000 57431458.00000 57252533.00000 57159597.00000 57398303.00000 57832935.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Midwest 68998970.00000 68984258.00000 68872831.00000 68903297.00000 69186401.00000 69596584.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 South 126281537.00000 126476549.00000 127368010.00000 129037849.00000 130893358.00000 132665693.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 West 78617523.00000 78685455.00000 78606386.00000 78916578.00000 79328169.00000 80015776.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 .Alabama 5025369.00000 5033094.00000 5049196.00000 5076181.00000 5117673.00000 5157699.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 .Alaska 733395.00000 733017.00000 734420.00000 734442.00000 736510.00000 740133.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 .Arizona 7158110.00000 7187135.00000 7274078.00000 7377566.00000 7473027.00000 7582384.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 .Arkansas 3011553.00000 3014546.00000 3026870.00000 3047704.00000 3069463.00000 3088354.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 .California 39555674.00000 39521958.00000 39142565.00000 39142414.00000 39198693.00000 39431263.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Unnamed: 0 Unnamed: 1 2020 2021 2022 2023 2024 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Unnamed: 29 Unnamed: 30 Unnamed: 31 Unnamed: 32 Unnamed: 33 Unnamed: 34 Unnamed: 35
59 Note: The estimates are developed from a base ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
60 Suggested Citation: NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
61 Annual Estimates of the Resident Population fo... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62 Source: U.S. Census Bureau, Population Division NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
63 Release Date: December 2024 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [18]:
# census 2020s data
census_2020s_df = (
    census_2020s_df
    .drop(columns=["Unnamed: 1"])
    .rename(columns={"Unnamed: 0": "Geographic Area"})
    # ignore all the blank extra columns
    .loc[:, "Geographic Area":"2024"]
    .convert_dtypes()       
    .dropna()                  
)
census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
census_2020s_df
Out[18]:
Geographic Area 2020 2021 2022 2023 2024
0 United States 331577720 332099760 334017321 336806231 340110988
1 Northeast 57431458 57252533 57159597 57398303 57832935
2 Midwest 68984258 68872831 68903297 69186401 69596584
3 South 126476549 127368010 129037849 130893358 132665693
4 West 78685455 78606386 78916578 79328169 80015776
... ... ... ... ... ... ...
52 Washington 7727209 7743760 7794123 7857320 7958180
53 West Virginia 1791646 1785618 1774122 1770495 1769979
54 Wisconsin 5897375 5881608 5903975 5930405 5960975
55 Wyoming 577681 579636 581978 585067 587618
57 Puerto Rico 3281590 3262711 3220137 3203792 3203295

57 rows × 6 columns

With that, we're in business!

We now have U.S. Census data from 2019, 2020, and 2021 in a format that is compatible with our TB case count data.




👥 Joining TB case counts with census data¶

Time to merge our datasets (i.e., join them)!

In [19]:
# Show the three tables that we are going to join.
# To keep things simple, let's just look at the last two rows of each df.
display(tb_df.tail(2))
display(census_2010s_df.tail(2))
display(census_2020s_df.tail(2))
U.S. jurisdiction 2019 2020 2021
50 Wisconsin 51 35 66
51 Wyoming 1 0 3
Geographic Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
55 Wyoming 564487 567299 576305 582122 582531 585613 584215 578931 577601 578759
57 Puerto Rico 3721525 3678732 3634488 3593077 3534874 3473232 3406672 3325286 3193354 3193694
Geographic Area 2020 2021 2022 2023 2024
55 Wyoming 577681 579636 581978 585067 587618
57 Puerto Rico 3281590 3262711 3220137 3203792 3203295

We're only interested in the population for the years 2019, 2020, and 2021, so let's select just those columns:

In [20]:
# Select only the relevant population years
census_2019_df = census_2010s_df[['Geographic Area', '2019']]
census_2020_2021_df = census_2020s_df[['Geographic Area', '2020', '2021']]

display(tb_df.tail(2))
display(census_2019_df.tail(2))
display(census_2020_2021_df.tail(2))
U.S. jurisdiction 2019 2020 2021
50 Wisconsin 51 35 66
51 Wyoming 1 0 3
Geographic Area 2019
55 Wyoming 578759
57 Puerto Rico 3193694
Geographic Area 2020 2021
55 Wyoming 577681 579636
57 Puerto Rico 3281590 3262711

All three dataframes have a column containing U.S. states, along with some other geographic areas. These columns are our join keys.

  • Below, we use df1.merge(right=df2, ...) to carry out the merge (documentation).

  • We could have alternatively used the function pd.merge(left=df1, right=df2, ...) (documentation).

In [21]:
# merge TB dataframe with two US census dataframes
tb_census_df = (
    tb_df
    .merge(right=census_2019_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area")
    .merge(right=census_2020_2021_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area")
)
tb_census_df.tail(2)
Out[21]:
U.S. jurisdiction 2019_x 2020_x 2021_x Geographic Area_x 2019_y Geographic Area_y 2020_y 2021_y
49 Wisconsin 51 35 66 Wisconsin 5822434 Wisconsin 5897375 5881608
50 Wyoming 1 0 3 Wyoming 578759 Wyoming 577681 579636

To see what's going on with the duplicate columns, and the _x and _y, let's do the just the first merge:

In [22]:
tb_df.merge(right=census_2019_df, 
            left_on="U.S. jurisdiction", 
            right_on="Geographic Area").head()
Out[22]:
U.S. jurisdiction 2019_x 2020 2021 Geographic Area 2019_y
0 Alabama 87 72 92 Alabama 4903185
1 Alaska 58 58 58 Alaska 731545
2 Arizona 183 136 129 Arizona 7278717
3 Arkansas 64 59 69 Arkansas 3017804
4 California 2111 1706 1750 California 39512223

Notice that the columns containing the join keys have all been retained, and all contain the same values.

  • Furthermore, notice that the duplicated columns are appended with _x and _y to keep the column names unique.

  • In the TB case count data, column 2019 represents the number of TB cases in 2019, but in the Census data, column 2019 represents the U.S. population.

We can use the suffixes argument to modify the _x and _y defaults to our liking (documentation).

In [23]:
# Specify the suffixes to use for duplicated column names
tb_df.merge(right=census_2019_df,
           left_on="U.S. jurisdiction", 
           right_on="Geographic Area",
           suffixes=('_cases', '_population')).head()
Out[23]:
U.S. jurisdiction 2019_cases 2020 2021 Geographic Area 2019_population
0 Alabama 87 72 92 Alabama 4903185
1 Alaska 58 58 58 Alaska 731545
2 Arizona 183 136 129 Arizona 7278717
3 Arkansas 64 59 69 Arkansas 3017804
4 California 2111 1706 1750 California 39512223

Notice the _x and _y have changed to _cases and _population, just like we specified.

Putting it all together, and dropping the duplicated Geographic Area columns:

In [24]:
# Redux: merge TB dataframe with two US census dataframes
tb_census_df = (
    tb_df
    
    .merge(right=census_2019_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area",
           suffixes=('_cases', '_population'))
    .drop(columns="Geographic Area")

    .merge(right=census_2020_2021_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area",
           suffixes=('_cases', '_population'))
    .drop(columns="Geographic Area")
    
)
tb_census_df.tail(2)
Out[24]:
U.S. jurisdiction 2019_cases 2020_cases 2021_cases 2019_population 2020_population 2021_population
49 Wisconsin 51 35 66 5822434 5897375 5881608
50 Wyoming 1 0 3 578759 577681 579636

♻️ Reproduce incidence¶

Let's see if we can reproduce the original CDC numbers from our augmented dataset of TB case counts and state populations.

  • Recall that the nationwide TB incidence was 2.7 in 2019, 2.2 in 2020, and 2.4 in 2021.

  • Along the way, we'll also compute state-level incidence.

From the CDC report: TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”

Let's start with a simpler question: What is the per person incidence?

  • In other words, what is the probability that a randomly selected person in the population had TB within a given year?

$$\text{TB incidence per person} = \frac{\text{\# TB cases in population}}{\text{Total population size}}$$

Let's calculate per person incidence for 2019:

In [25]:
# Calculate per person incidence for 2019
tb_census_df["per person incidence 2019"] = (
    tb_census_df["2019_cases"]/tb_census_df["2019_population"]
)
tb_census_df
Out[25]:
U.S. jurisdiction 2019_cases 2020_cases 2021_cases 2019_population 2020_population 2021_population per person incidence 2019
0 Alabama 87 72 92 4903185 5033094 5049196 0.00002
1 Alaska 58 58 58 731545 733017 734420 0.00008
2 Arizona 183 136 129 7278717 7187135 7274078 0.00003
3 Arkansas 64 59 69 3017804 3014546 3026870 0.00002
4 California 2111 1706 1750 39512223 39521958 39142565 0.00005
... ... ... ... ... ... ... ... ...
46 Virginia 191 169 161 8535519 8637615 8658910 0.00002
47 Washington 221 163 199 7614893 7727209 7743760 0.00003
48 West Virginia 9 13 7 1792147 1791646 1785618 0.00001
49 Wisconsin 51 35 66 5822434 5897375 5881608 0.00001
50 Wyoming 1 0 3 578759 577681 579636 0.00000

51 rows × 8 columns

TB is really rare in the United States, so per person TB incidence is really low, as expected.

  • But, if we were to consider 100,000 people, the probability of seeing a TB case is higher.

  • In fact, it would be 100,000 times higher!

$$\text{TB incidence per 100,000} = \text{100,000} * \text{TB incidence per person}$$

In [26]:
# To help read bigger numbers in Python, you can use _ to separate thousands,
# akin to using commas. 100_000 is the same as writing 100000, but more readable.
tb_census_df["per 100k incidence 2019"] = (
    100_000 * tb_census_df["per person incidence 2019"] 
)
tb_census_df
Out[26]:
U.S. jurisdiction 2019_cases 2020_cases 2021_cases 2019_population 2020_population 2021_population per person incidence 2019 per 100k incidence 2019
0 Alabama 87 72 92 4903185 5033094 5049196 0.00002 1.77436
1 Alaska 58 58 58 731545 733017 734420 0.00008 7.92843
2 Arizona 183 136 129 7278717 7187135 7274078 0.00003 2.51418
3 Arkansas 64 59 69 3017804 3014546 3026870 0.00002 2.12075
4 California 2111 1706 1750 39512223 39521958 39142565 0.00005 5.34265
... ... ... ... ... ... ... ... ... ...
46 Virginia 191 169 161 8535519 8637615 8658910 0.00002 2.23771
47 Washington 221 163 199 7614893 7727209 7743760 0.00003 2.90221
48 West Virginia 9 13 7 1792147 1791646 1785618 0.00001 0.50219
49 Wisconsin 51 35 66 5822434 5897375 5881608 0.00001 0.87592
50 Wyoming 1 0 3 578759 577681 579636 0.00000 0.17278

51 rows × 9 columns

Now we're seeing more human-readable values.

  • For example, there 5.3 tuberculosis cases for every 100,000 California residents in 2019.

To wrap up this exercise, let's calculate the nationwide incidence of TB in 2019.

In [27]:
# Recall that the CDC reported an incidence of 2.7 per 100,000 in 2019.
tot_tb_cases_50_states = tb_census_df["2019_cases"].sum()
tot_pop_50_states = tb_census_df["2019_population"].sum()
tb_per_100k_50_states = 100_000 * tot_tb_cases_50_states / tot_pop_50_states
tb_per_100k_50_states
Out[27]:
2.7114346007625656

We can use a for loop to compute the incidence for 2019, 2020, and 2021.

  • You'll notice that we get the same numbers reported by the CDC!
In [28]:
# f strings (f"...") are a handy way to pass in variables to strings.
for year in [2019, 2020, 2021]:
  tot_tb_cases_50_states = tb_census_df[f"{year}_cases"].sum()
  tot_pop_50_states = tb_census_df[f"{year}_population"].sum()
  tb_per_100k_50_states = 100_000 * tot_tb_cases_50_states / tot_pop_50_states
  print(tb_per_100k_50_states)
2.7114346007625656
2.163293721906285
2.366758711298075




Instructor Note: Return to Slides!