Lecture 4 – Data 100, Fall 2023¶

Data 100, Fall 2023

Acknowledgments Page

In [1]:
import numpy as np
import pandas as pd
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)

# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

Tuberculosis in the United States¶

What can we say about the presence of Tuberculosis in the United States?

Let's look at the data included in the original CDC article published in 2021.



CSV and Nice Field Names¶

Suppose Table 1 was saved as a CSV file located in data/cdc_tuberculosis.csv.

(Side note: For the purposes of this demo, I actually manually made this CSV by copy-pasting into Excel, then saved as .csv. The powers of data science 🥳)

We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:

  1. Using a text editor like emacs, vim, VSCode, etc.
  2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
  3. The Python file object
  4. pandas, using pd.read_csv()

1, 2. Let's start with the first two so we really solidify the idea of a CSV as rectangular data (i.e., tabular data) stored as comma-separated values.



  1. Next, let's try using the Python file object. Let's check out the first four lines:
In [3]:
with open("data/cdc_tuberculosis.csv", "r") as f:
    for i, row in enumerate(f):
        print(row)
        if i >= 3: break
,No. of TB cases,,,TB incidence,,

U.S. jurisdiction,2019,2020,2021,2019,2020,2021

Total,"8,900","7,173","7,860",2.71,2.16,2.37

Alabama,87,72,92,1.77,1.43,1.83

Whoa, why are there blank lines interspaced between the lines of the CSV?

You may recall that all line breaks in text files are encoded as the special newline character \n. Python's print() prints each string (including the newline), and an additional newline on top of that.

If you're curious, we can use the repr() function to return the raw string with all special characters:

In [4]:
with open("data/cdc_tuberculosis.csv", "r") as f:
    for i, row in enumerate(f):
        print(repr(row)) # print raw strings
        if i >= 3: break
',No. of TB cases,,,TB incidence,,\n'
'U.S. jurisdiction,2019,2020,2021,2019,2020,2021\n'
'Total,"8,900","7,173","7,860",2.71,2.16,2.37\n'
'Alabama,87,72,92,1.77,1.43,1.83\n'



  1. Finally, let's see the tried-and-true Data 100 approach: pandas.
In [5]:
tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
tb_df
Out[5]:
Unnamed: 0 No. of TB cases Unnamed: 2 Unnamed: 3 TB incidence Unnamed: 5 Unnamed: 6
0 U.S. jurisdiction 2019 2020 2021 2019.00 2020.00 2021.00
1 Total 8,900 7,173 7,860 2.71 2.16 2.37
2 Alabama 87 72 92 1.77 1.43 1.83
3 Alaska 58 58 58 7.91 7.92 7.92
4 Arizona 183 136 129 2.51 1.89 1.77
... ... ... ... ... ... ... ...
48 Virginia 191 169 161 2.23 1.96 1.86
49 Washington 221 163 199 2.90 2.11 2.57
50 West Virginia 9 13 7 0.50 0.73 0.39
51 Wisconsin 51 35 66 0.88 0.59 1.12
52 Wyoming 1 0 3 0.17 0.00 0.52

53 rows × 7 columns

Wait, what's up with the "Unnamed" column names? And the first row, for that matter?

Congratulations -- you're ready to wrangle your data. Because of how things are stored, we'll need to clean the data a bit to name our columns better.

A reasonable first step is to identify the row with the right header. The pd.read_csv() function (documentation) has the convenient header parameter:

In [6]:
tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
tb_df
Out[6]:
U.S. jurisdiction 2019 2020 2021 2019.1 2020.1 2021.1
0 Total 8,900 7,173 7,860 2.71 2.16 2.37
1 Alabama 87 72 92 1.77 1.43 1.83
2 Alaska 58 58 58 7.91 7.92 7.92
3 Arizona 183 136 129 2.51 1.89 1.77
4 Arkansas 64 59 69 2.12 1.96 2.28
... ... ... ... ... ... ... ...
47 Virginia 191 169 161 2.23 1.96 1.86
48 Washington 221 163 199 2.90 2.11 2.57
49 West Virginia 9 13 7 0.50 0.73 0.39
50 Wisconsin 51 35 66 0.88 0.59 1.12
51 Wyoming 1 0 3 0.17 0.00 0.52

52 rows × 7 columns

Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. pandas has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us as humans understand the data.

We can do this manually with df.rename() (documentation):

In [7]:
rename_dict = {'2019': 'TB cases 2019',
               '2020': 'TB cases 2020',
               '2021': 'TB cases 2021',
               '2019.1': 'TB incidence 2019',
               '2020.1': 'TB incidence 2020',
               '2021.1': 'TB incidence 2021'}
tb_df = tb_df.rename(columns=rename_dict)
tb_df
Out[7]:
U.S. jurisdiction TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021
0 Total 8,900 7,173 7,860 2.71 2.16 2.37
1 Alabama 87 72 92 1.77 1.43 1.83
2 Alaska 58 58 58 7.91 7.92 7.92
3 Arizona 183 136 129 2.51 1.89 1.77
4 Arkansas 64 59 69 2.12 1.96 2.28
... ... ... ... ... ... ... ...
47 Virginia 191 169 161 2.23 1.96 1.86
48 Washington 221 163 199 2.90 2.11 2.57
49 West Virginia 9 13 7 0.50 0.73 0.39
50 Wisconsin 51 35 66 0.88 0.59 1.12
51 Wyoming 1 0 3 0.17 0.00 0.52

52 rows × 7 columns




Record Granularity¶

You might already be wondering: What's up with that first record?

Row 0 is what we call a rollup record, or summary record. It's often useful when displaying tables to humans. The granularity of record 0 (Totals) vs the rest of the records (States) is different.


Okay, EDA step two. How was the rollup record aggregated?

Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get 2x the total cases in each of our TB cases by year (why?).

In [8]:
tb_df.sum(axis=0)
Out[8]:
U.S. jurisdiction    TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...
TB cases 2019        8,9008758183642,111666718245583029973261085237...
TB cases 2020        7,1737258136591,706525417194122219282169239376...
TB cases 2021        7,8609258129691,750585443194992281064255127494...
TB incidence 2019                                               109.94
TB incidence 2020                                                93.09
TB incidence 2021                                               102.94
dtype: object


Whoa, what's going on? Check out the column types:

In [9]:
tb_df.dtypes
Out[9]:
U.S. jurisdiction     object
TB cases 2019         object
TB cases 2020         object
TB cases 2021         object
TB incidence 2019    float64
TB incidence 2020    float64
TB incidence 2021    float64
dtype: object


Looks like those commas are causing all TB cases to be read as the object datatype, or storage type (close to the Python string datatype), so pandas is concatenating strings instead of adding integers.


Fortunately read_csv also has a thousands parameter (for what it's worth, I didn't know this beforehand--I googled this):

In [10]:
# improve readability: chaining method calls with outer parentheses/line breaks
tb_df = (
    pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
    .rename(columns=rename_dict)
)
tb_df
Out[10]:
U.S. jurisdiction TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021
0 Total 8900 7173 7860 2.71 2.16 2.37
1 Alabama 87 72 92 1.77 1.43 1.83
2 Alaska 58 58 58 7.91 7.92 7.92
3 Arizona 183 136 129 2.51 1.89 1.77
4 Arkansas 64 59 69 2.12 1.96 2.28
... ... ... ... ... ... ... ...
47 Virginia 191 169 161 2.23 1.96 1.86
48 Washington 221 163 199 2.90 2.11 2.57
49 West Virginia 9 13 7 0.50 0.73 0.39
50 Wisconsin 51 35 66 0.88 0.59 1.12
51 Wyoming 1 0 3 0.17 0.00 0.52

52 rows × 7 columns

In [11]:
tb_df.sum()
Out[11]:
U.S. jurisdiction    TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...
TB cases 2019                                                    17800
TB cases 2020                                                    14346
TB cases 2021                                                    15720
TB incidence 2019                                               109.94
TB incidence 2020                                                93.09
TB incidence 2021                                               102.94
dtype: object

The Total TB cases look right. Phew!

(We'll leave it to your own EDA to figure out how the TB incidence "Totals" were aggregated...you may want to check out the bonus section first, though.)



Let's just look at the records with state-level granularity:

In [12]:
state_tb_df = tb_df[1:]
state_tb_df
Out[12]:
U.S. jurisdiction TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021
1 Alabama 87 72 92 1.77 1.43 1.83
2 Alaska 58 58 58 7.91 7.92 7.92
3 Arizona 183 136 129 2.51 1.89 1.77
4 Arkansas 64 59 69 2.12 1.96 2.28
5 California 2111 1706 1750 5.35 4.32 4.46
... ... ... ... ... ... ... ...
47 Virginia 191 169 161 2.23 1.96 1.86
48 Washington 221 163 199 2.90 2.11 2.57
49 West Virginia 9 13 7 0.50 0.73 0.39
50 Wisconsin 51 35 66 0.88 0.59 1.12
51 Wyoming 1 0 3 0.17 0.00 0.52

51 rows × 7 columns

What do each of these values represent? Why?

To the lecture!






Gather Census Data¶

U.S. Census population estimates source (2019), source (2020-2021).

Running the below cells cleans the data. We encourage you to closely explore the CSV and study these lines after lecture...

There are a few new methods here:

  • df.convert_dtypes() (documentation) conveniently converts all float dtypes into ints and is out of scope for the class.
  • df.drop_na() (documentation) will be explained in more detail next time.
In [13]:
# 2010s census data
census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
census_2010s_df = (
    census_2010s_df
    .reset_index()
    .drop(columns=["index", "Census", "Estimates Base"])
    .rename(columns={"Unnamed: 0": "Geographic Area"})
    .convert_dtypes()                 # "smart" converting of columns, use at your own risk
    .dropna()                         # we'll introduce this next time
)
census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')

# with pd.option_context('display.min_rows', 30): # shows more rows
#     display(census_2010s_df)
    
census_2010s_df
Out[13]:
Geographic Area 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 United States 309321666 311556874 313830990 315993715 318301008 320635163 322941311 324985539 326687501 328239523
1 Northeast 55380134 55604223 55775216 55901806 56006011 56034684 56042330 56059240 56046620 55982803
2 Midwest 66974416 67157800 67336743 67560379 67745167 67860583 67987540 68126781 68236628 68329004
3 South 114866680 116006522 117241208 118364400 119624037 120997341 122351760 123542189 124569433 125580448
4 West 72100436 72788329 73477823 74167130 74925793 75742555 76559681 77257329 77834820 78347268
... ... ... ... ... ... ... ... ... ... ... ...
52 Washington 6742830 6826627 6897058 6963985 7054655 7163657 7294771 7423362 7523869 7614893
53 West Virginia 1854239 1856301 1856872 1853914 1849489 1842050 1831023 1817004 1804291 1792147
54 Wisconsin 5690475 5705288 5719960 5736754 5751525 5760940 5772628 5790186 5807406 5822434
55 Wyoming 564487 567299 576305 582122 582531 585613 584215 578931 577601 578759
57 Puerto Rico 3721525 3678732 3634488 3593077 3534874 3473232 3406672 3325286 3193354 3193694

57 rows × 11 columns

Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use the python importlib library:

from importlib import reload
reload(utils)

or use iPython magic which will intelligently import code when files change:

%load_ext autoreload
%autoreload 2
In [14]:
# census 2020s data
census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
census_2020s_df = (
    census_2020s_df
    .reset_index()
    .drop(columns=["index", "Unnamed: 1"])
    .rename(columns={"Unnamed: 0": "Geographic Area"})
    .convert_dtypes()                 # "smart" converting of columns, use at your own risk
    .dropna()                         # we'll introduce this next time
)
census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')

census_2020s_df
Out[14]:
Geographic Area 2020 2021 2022
0 United States 331511512 332031554 333287557
1 Northeast 57448898 57259257 57040406
2 Midwest 68961043 68836505 68787595
3 South 126450613 127346029 128716192
4 West 78650958 78589763 78743364
... ... ... ... ...
52 Washington 7724031 7740745 7785786
53 West Virginia 1791420 1785526 1775156
54 Wisconsin 5896271 5880101 5892539
55 Wyoming 577605 579483 581381
57 Puerto Rico 3281557 3262693 3221789

57 rows × 4 columns




Join Data (Merge DataFrames)¶

Time to merge! Here I use the DataFrame method df1.merge(right=df2, ...) on DataFrame df1 (documentation). Contrast this with the function pd.merge(left=df1, right=df2, ...) (documentation). Feel free to use either.

In [15]:
# merge TB dataframe with two US census dataframes
tb_census_df = (
    tb_df
    .merge(right=census_2010s_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area")
    .merge(right=census_2020s_df,
           left_on="U.S. jurisdiction", right_on="Geographic Area")
)
tb_census_df
Out[15]:
U.S. jurisdiction TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021 Geographic Area_x 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Geographic Area_y 2020 2021 2022
0 Alabama 87 72 92 1.77 1.43 1.83 Alabama 4785437 4799069 4815588 4830081 4841799 4852347 4863525 4874486 4887681 4903185 Alabama 5031362 5049846 5074296
1 Alaska 58 58 58 7.91 7.92 7.92 Alaska 713910 722128 730443 737068 736283 737498 741456 739700 735139 731545 Alaska 732923 734182 733583
2 Arizona 183 136 129 2.51 1.89 1.77 Arizona 6407172 6472643 6554978 6632764 6730413 6829676 6941072 7044008 7158024 7278717 Arizona 7179943 7264877 7359197
3 Arkansas 64 59 69 2.12 1.96 2.28 Arkansas 2921964 2940667 2952164 2959400 2967392 2978048 2989918 3001345 3009733 3017804 Arkansas 3014195 3028122 3045637
4 California 2111 1706 1750 5.35 4.32 4.46 California 37319502 37638369 37948800 38260787 38596972 38918045 39167117 39358497 39461588 39512223 California 39501653 39142991 39029342
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
46 Virginia 191 169 161 2.23 1.96 1.86 Virginia 8023699 8101155 8185080 8252427 8310993 8361808 8410106 8463587 8501286 8535519 Virginia 8636471 8657365 8683619
47 Washington 221 163 199 2.90 2.11 2.57 Washington 6742830 6826627 6897058 6963985 7054655 7163657 7294771 7423362 7523869 7614893 Washington 7724031 7740745 7785786
48 West Virginia 9 13 7 0.50 0.73 0.39 West Virginia 1854239 1856301 1856872 1853914 1849489 1842050 1831023 1817004 1804291 1792147 West Virginia 1791420 1785526 1775156
49 Wisconsin 51 35 66 0.88 0.59 1.12 Wisconsin 5690475 5705288 5719960 5736754 5751525 5760940 5772628 5790186 5807406 5822434 Wisconsin 5896271 5880101 5892539
50 Wyoming 1 0 3 0.17 0.00 0.52 Wyoming 564487 567299 576305 582122 582531 585613 584215 578931 577601 578759 Wyoming 577605 579483 581381

51 rows × 22 columns

This is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census DataFrames. Let's do the latter.

In [16]:
# try merging again, but cleaner this time
tb_census_df = (
    tb_df
    .merge(right=census_2010s_df[["Geographic Area", "2019"]],
           left_on="U.S. jurisdiction", right_on="Geographic Area")
    .drop(columns="Geographic Area")
    .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
           left_on="U.S. jurisdiction", right_on="Geographic Area")
    .drop(columns="Geographic Area")
)
tb_census_df
Out[16]:
U.S. jurisdiction TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021 2019 2020 2021
0 Alabama 87 72 92 1.77 1.43 1.83 4903185 5031362 5049846
1 Alaska 58 58 58 7.91 7.92 7.92 731545 732923 734182
2 Arizona 183 136 129 2.51 1.89 1.77 7278717 7179943 7264877
3 Arkansas 64 59 69 2.12 1.96 2.28 3017804 3014195 3028122
4 California 2111 1706 1750 5.35 4.32 4.46 39512223 39501653 39142991
... ... ... ... ... ... ... ... ... ... ...
46 Virginia 191 169 161 2.23 1.96 1.86 8535519 8636471 8657365
47 Washington 221 163 199 2.90 2.11 2.57 7614893 7724031 7740745
48 West Virginia 9 13 7 0.50 0.73 0.39 1792147 1791420 1785526
49 Wisconsin 51 35 66 0.88 0.59 1.12 5822434 5896271 5880101
50 Wyoming 1 0 3 0.17 0.00 0.52 578759 577605 579483

51 rows × 10 columns

Reproduce incidence¶

Let's recompute incidence to make sure we know where the original CDC numbers came from.

From the CDC report: TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”

If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as

$$\text{TB incidence} = \frac{\text{\# TB cases in population}}{\text{\# groups in population}} = \frac{\text{\# TB cases in population}}{\text{population}/100000} $$

$$= \frac{\text{\# TB cases in population}}{\text{population}} \times 100000$$

Let's try this for 2019:

In [17]:
tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
tb_census_df
Out[17]:
U.S. jurisdiction TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021 2019 2020 2021 recompute incidence 2019
0 Alabama 87 72 92 1.77 1.43 1.83 4903185 5031362 5049846 1.77
1 Alaska 58 58 58 7.91 7.92 7.92 731545 732923 734182 7.93
2 Arizona 183 136 129 2.51 1.89 1.77 7278717 7179943 7264877 2.51
3 Arkansas 64 59 69 2.12 1.96 2.28 3017804 3014195 3028122 2.12
4 California 2111 1706 1750 5.35 4.32 4.46 39512223 39501653 39142991 5.34
... ... ... ... ... ... ... ... ... ... ... ...
46 Virginia 191 169 161 2.23 1.96 1.86 8535519 8636471 8657365 2.24
47 Washington 221 163 199 2.90 2.11 2.57 7614893 7724031 7740745 2.90
48 West Virginia 9 13 7 0.50 0.73 0.39 1792147 1791420 1785526 0.50
49 Wisconsin 51 35 66 0.88 0.59 1.12 5822434 5896271 5880101 0.88
50 Wyoming 1 0 3 0.17 0.00 0.52 578759 577605 579483 0.17

51 rows × 11 columns

Awesome!!!

Let's use a for-loop and Python format strings to compute TB incidence for all years. Python f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course (Python documentation).

In [18]:
# recompute incidence for all years
for year in [2019, 2020, 2021]:
    tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
tb_census_df
Out[18]:
U.S. jurisdiction TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021 2019 2020 2021 recompute incidence 2019 recompute incidence 2020 recompute incidence 2021
0 Alabama 87 72 92 1.77 1.43 1.83 4903185 5031362 5049846 1.77 1.43 1.82
1 Alaska 58 58 58 7.91 7.92 7.92 731545 732923 734182 7.93 7.91 7.90
2 Arizona 183 136 129 2.51 1.89 1.77 7278717 7179943 7264877 2.51 1.89 1.78
3 Arkansas 64 59 69 2.12 1.96 2.28 3017804 3014195 3028122 2.12 1.96 2.28
4 California 2111 1706 1750 5.35 4.32 4.46 39512223 39501653 39142991 5.34 4.32 4.47
... ... ... ... ... ... ... ... ... ... ... ... ... ...
46 Virginia 191 169 161 2.23 1.96 1.86 8535519 8636471 8657365 2.24 1.96 1.86
47 Washington 221 163 199 2.90 2.11 2.57 7614893 7724031 7740745 2.90 2.11 2.57
48 West Virginia 9 13 7 0.50 0.73 0.39 1792147 1791420 1785526 0.50 0.73 0.39
49 Wisconsin 51 35 66 0.88 0.59 1.12 5822434 5896271 5880101 0.88 0.59 1.12
50 Wyoming 1 0 3 0.17 0.00 0.52 578759 577605 579483 0.17 0.00 0.52

51 rows × 13 columns

These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy. We'll leave it to you!

In [19]:
tb_census_df.describe()
Out[19]:
TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021 2019 2020 2021 recompute incidence 2019 recompute incidence 2020 recompute incidence 2021
count 51.00 51.00 51.00 51.00 51.00 51.00 51.00 51.00 51.00 51.00 51.00 51.00
mean 174.51 140.65 154.12 2.10 1.78 1.97 6436069.08 6500225.73 6510422.63 2.10 1.78 1.97
std 341.74 271.06 286.78 1.50 1.34 1.48 7360660.47 7408168.46 7394300.08 1.50 1.34 1.47
min 1.00 0.00 2.00 0.17 0.00 0.21 578759.00 577605.00 579483.00 0.17 0.00 0.21
25% 25.50 29.00 23.00 1.29 1.21 1.23 1789606.00 1820311.00 1844920.00 1.30 1.21 1.23
50% 70.00 67.00 69.00 1.80 1.52 1.70 4467673.00 4507445.00 4506589.00 1.81 1.52 1.69
75% 180.50 139.00 150.00 2.58 1.99 2.22 7446805.00 7451987.00 7502811.00 2.58 1.99 2.22
max 2111.00 1706.00 1750.00 7.91 7.92 7.92 39512223.00 39501653.00 39142991.00 7.93 7.91 7.90





Bonus EDA¶

We likely won't get to this part, so a tutorial is provided for your own curiosity.

How do we reproduce that reported statistic in the original CDC report?

Reported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.

This is TB incidence computed across the entire U.S. population! How do we reproduce this

  • We need to reproduce the "Total" TB incidences in our rolled record.
  • But our current tb_census_df only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
  • What happened...?

Let's get exploring!

Before we keep exploring, I'm going to set all indexes to more meaningful values, instead of just numbers that pertained to some row at some point. This will make our cleaning slightly easier.

In [20]:
tb_df = tb_df.set_index("U.S. jurisdiction")
tb_df
Out[20]:
TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021
U.S. jurisdiction
Total 8900 7173 7860 2.71 2.16 2.37
Alabama 87 72 92 1.77 1.43 1.83
Alaska 58 58 58 7.91 7.92 7.92
Arizona 183 136 129 2.51 1.89 1.77
Arkansas 64 59 69 2.12 1.96 2.28
... ... ... ... ... ... ...
Virginia 191 169 161 2.23 1.96 1.86
Washington 221 163 199 2.90 2.11 2.57
West Virginia 9 13 7 0.50 0.73 0.39
Wisconsin 51 35 66 0.88 0.59 1.12
Wyoming 1 0 3 0.17 0.00 0.52

52 rows × 6 columns

In [21]:
census_2010s_df = census_2010s_df.set_index("Geographic Area")
census_2010s_df
Out[21]:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Geographic Area
United States 309321666 311556874 313830990 315993715 318301008 320635163 322941311 324985539 326687501 328239523
Northeast 55380134 55604223 55775216 55901806 56006011 56034684 56042330 56059240 56046620 55982803
Midwest 66974416 67157800 67336743 67560379 67745167 67860583 67987540 68126781 68236628 68329004
South 114866680 116006522 117241208 118364400 119624037 120997341 122351760 123542189 124569433 125580448
West 72100436 72788329 73477823 74167130 74925793 75742555 76559681 77257329 77834820 78347268
... ... ... ... ... ... ... ... ... ... ...
Washington 6742830 6826627 6897058 6963985 7054655 7163657 7294771 7423362 7523869 7614893
West Virginia 1854239 1856301 1856872 1853914 1849489 1842050 1831023 1817004 1804291 1792147
Wisconsin 5690475 5705288 5719960 5736754 5751525 5760940 5772628 5790186 5807406 5822434
Wyoming 564487 567299 576305 582122 582531 585613 584215 578931 577601 578759
Puerto Rico 3721525 3678732 3634488 3593077 3534874 3473232 3406672 3325286 3193354 3193694

57 rows × 10 columns

In [22]:
census_2020s_df = census_2020s_df.set_index("Geographic Area")
census_2020s_df
Out[22]:
2020 2021 2022
Geographic Area
United States 331511512 332031554 333287557
Northeast 57448898 57259257 57040406
Midwest 68961043 68836505 68787595
South 126450613 127346029 128716192
West 78650958 78589763 78743364
... ... ... ...
Washington 7724031 7740745 7785786
West Virginia 1791420 1785526 1775156
Wisconsin 5896271 5880101 5892539
Wyoming 577605 579483 581381
Puerto Rico 3281557 3262693 3221789

57 rows × 3 columns

It turns out that our merge above only kept state records, even though our original tb_df had the "Total" rolled record:

In [23]:
tb_df.head()
Out[23]:
TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021
U.S. jurisdiction
Total 8900 7173 7860 2.71 2.16 2.37
Alabama 87 72 92 1.77 1.43 1.83
Alaska 58 58 58 7.91 7.92 7.92
Arizona 183 136 129 2.51 1.89 1.77
Arkansas 64 59 69 2.12 1.96 2.28

Recall that merge by default does an inner merge by default, meaning that it only preserves keys that are present in both DataFrames.

The rolled records in our census dataframes have different Geographic Area fields, which was the key we merged on:

In [24]:
census_2010s_df
Out[24]:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Geographic Area
United States 309321666 311556874 313830990 315993715 318301008 320635163 322941311 324985539 326687501 328239523
Northeast 55380134 55604223 55775216 55901806 56006011 56034684 56042330 56059240 56046620 55982803
Midwest 66974416 67157800 67336743 67560379 67745167 67860583 67987540 68126781 68236628 68329004
South 114866680 116006522 117241208 118364400 119624037 120997341 122351760 123542189 124569433 125580448
West 72100436 72788329 73477823 74167130 74925793 75742555 76559681 77257329 77834820 78347268
... ... ... ... ... ... ... ... ... ... ...
Washington 6742830 6826627 6897058 6963985 7054655 7163657 7294771 7423362 7523869 7614893
West Virginia 1854239 1856301 1856872 1853914 1849489 1842050 1831023 1817004 1804291 1792147
Wisconsin 5690475 5705288 5719960 5736754 5751525 5760940 5772628 5790186 5807406 5822434
Wyoming 564487 567299 576305 582122 582531 585613 584215 578931 577601 578759
Puerto Rico 3721525 3678732 3634488 3593077 3534874 3473232 3406672 3325286 3193354 3193694

57 rows × 10 columns

The Census DataFrame has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".

One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use df.rename() (documentation):

In [25]:
# rename rolled record for 2010s
census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
census_2010s_df
Out[25]:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Geographic Area
Total 309321666 311556874 313830990 315993715 318301008 320635163 322941311 324985539 326687501 328239523
Northeast 55380134 55604223 55775216 55901806 56006011 56034684 56042330 56059240 56046620 55982803
Midwest 66974416 67157800 67336743 67560379 67745167 67860583 67987540 68126781 68236628 68329004
South 114866680 116006522 117241208 118364400 119624037 120997341 122351760 123542189 124569433 125580448
West 72100436 72788329 73477823 74167130 74925793 75742555 76559681 77257329 77834820 78347268
... ... ... ... ... ... ... ... ... ... ...
Washington 6742830 6826627 6897058 6963985 7054655 7163657 7294771 7423362 7523869 7614893
West Virginia 1854239 1856301 1856872 1853914 1849489 1842050 1831023 1817004 1804291 1792147
Wisconsin 5690475 5705288 5719960 5736754 5751525 5760940 5772628 5790186 5807406 5822434
Wyoming 564487 567299 576305 582122 582531 585613 584215 578931 577601 578759
Puerto Rico 3721525 3678732 3634488 3593077 3534874 3473232 3406672 3325286 3193354 3193694

57 rows × 10 columns

In [26]:
# same, but for 2020s rename rolled record
census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
census_2020s_df
Out[26]:
2020 2021 2022
Geographic Area
Total 331511512 332031554 333287557
Northeast 57448898 57259257 57040406
Midwest 68961043 68836505 68787595
South 126450613 127346029 128716192
West 78650958 78589763 78743364
... ... ... ...
Washington 7724031 7740745 7785786
West Virginia 1791420 1785526 1775156
Wisconsin 5896271 5880101 5892539
Wyoming 577605 579483 581381
Puerto Rico 3281557 3262693 3221789

57 rows × 3 columns


Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (df.merge() documentation).

In [27]:
tb_census_df = (
    tb_df
    .merge(right=census_2010s_df[["2019"]],
           left_index=True, right_index=True)
    .merge(right=census_2020s_df[["2020", "2021"]],
           left_index=True, right_index=True)
)
tb_census_df
Out[27]:
TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021 2019 2020 2021
Total 8900 7173 7860 2.71 2.16 2.37 328239523 331511512 332031554
Alabama 87 72 92 1.77 1.43 1.83 4903185 5031362 5049846
Alaska 58 58 58 7.91 7.92 7.92 731545 732923 734182
Arizona 183 136 129 2.51 1.89 1.77 7278717 7179943 7264877
Arkansas 64 59 69 2.12 1.96 2.28 3017804 3014195 3028122
... ... ... ... ... ... ... ... ... ...
Virginia 191 169 161 2.23 1.96 1.86 8535519 8636471 8657365
Washington 221 163 199 2.90 2.11 2.57 7614893 7724031 7740745
West Virginia 9 13 7 0.50 0.73 0.39 1792147 1791420 1785526
Wisconsin 51 35 66 0.88 0.59 1.12 5822434 5896271 5880101
Wyoming 1 0 3 0.17 0.00 0.52 578759 577605 579483

52 rows × 9 columns


Finally, let's recompute our incidences:

In [28]:
# recompute incidence for all years
for year in [2019, 2020, 2021]:
    tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
tb_census_df
Out[28]:
TB cases 2019 TB cases 2020 TB cases 2021 TB incidence 2019 TB incidence 2020 TB incidence 2021 2019 2020 2021 recompute incidence 2019 recompute incidence 2020 recompute incidence 2021
Total 8900 7173 7860 2.71 2.16 2.37 328239523 331511512 332031554 2.71 2.16 2.37
Alabama 87 72 92 1.77 1.43 1.83 4903185 5031362 5049846 1.77 1.43 1.82
Alaska 58 58 58 7.91 7.92 7.92 731545 732923 734182 7.93 7.91 7.90
Arizona 183 136 129 2.51 1.89 1.77 7278717 7179943 7264877 2.51 1.89 1.78
Arkansas 64 59 69 2.12 1.96 2.28 3017804 3014195 3028122 2.12 1.96 2.28
... ... ... ... ... ... ... ... ... ... ... ... ...
Virginia 191 169 161 2.23 1.96 1.86 8535519 8636471 8657365 2.24 1.96 1.86
Washington 221 163 199 2.90 2.11 2.57 7614893 7724031 7740745 2.90 2.11 2.57
West Virginia 9 13 7 0.50 0.73 0.39 1792147 1791420 1785526 0.50 0.73 0.39
Wisconsin 51 35 66 0.88 0.59 1.12 5822434 5896271 5880101 0.88 0.59 1.12
Wyoming 1 0 3 0.17 0.00 0.52 578759 577605 579483 0.17 0.00 0.52

52 rows × 12 columns

We reproduced the total U.S. incidences correctly!

We're almost there. Let's revisit the quote:

Reported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.

Recall that percent change from $A$ to $B$ is computed as $$\text{percent change} = \frac{B - A}{A} \times 100$$.

In [29]:
incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
incidence_2020
Out[29]:
2.1637257652759883
In [30]:
incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
incidence_2021
Out[30]:
2.3672448914298068
In [31]:
difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
difference
Out[31]:
9.405957511804143

We did it!!!