Lecture 10 Supplemental Notebook

Data 100, Summer 2021

Suraj Rampure, with updates by Fernando Pérez.

Scale

Let's now compute the relative change between the two years...

Current Population Survey

Now, let's compute the income gap as a relative quantity between men and women. Recall that the structure of the dataframe is as follows:

This calls for using groupby by Gender, so that we can separate the data for both genders, and then compute the ratio:

Let's now compute the alternate ratio, F/M instead:

Overplotting

Kernel Density Estimates

Let's define some kernels. We will explain these formulas momentarily. We'll also define some helper functions for visualization purposes.

Here are our five points.

Step 1: Place a kernel at each point

We'll start with the Gaussian kernel.

Step 2: Normalize kernels so that total area is 1

Step 3: Sum all kernels together

This looks identical to the smooth curve that sns.distplot gives us (when we set the appropriate parameter):

Kernels

Gaussian

$$K_{\alpha}(x, x_i) = \frac{1}{\sqrt{2 \pi \alpha^2}} e^{-\frac{(x - x_i)^2}{2\alpha^2}}$$

Boxcar

$$K_{\alpha}(x, x_i) = \begin {cases} \frac{1}{\alpha}, \: \: \: |x - x_i| \leq \frac{\alpha}{2}\\ 0, \: \: \: \text{else} \end{cases}$$

Effect of bandwidth hyperparameter $\alpha$

Let's bring in some (different) toy data.

KDE Formula

$$f_{\alpha}(x) = \sum_{i = 1}^n \frac{1}{n} \cdot K_{\alpha}(x, x_i) = \frac{1}{n} \sum_{i = 1}^n K_{\alpha}(x, x_i)$$

CO2 Emissions

Functional relations

Transformations

A synthetic example

Let's generate data that follows $y = 2x^3 + \epsilon$, where $\epsilon$ is zero-mean noise. Note that given the functional form of $y$, if we simply draw $\epsilon \sim \mathcal{N}(0,1)$, it will be insignificant for higher values of $x$ (in the range we'll look, $[1..10]$). So we will make $\epsilon \sim x^2\mathcal{N}(0,1)$ so that the noise is present for all values of $x$ and $y$.

The bulge diagram says to raise $x$ to a power, or to take the log of $y$.

First, let's raise $x$ to a power:

We used $x^2$ as the transformation. It's better, but still not linear. Let's try $x^3$.

That worked well, which makes sense: the original data was cubic in $x$. We can overdo it, too: let's try $x^5$.

Now, the data follows some sort of square root relationship. It's certainly not linear; this goes to show that not all power transformations work the same way, and you'll need some experimentation.

Let's instead try taking the log of y from the original data.

On it's own, this didn't quite work! Since $y = 2x^3$, $\log(y) = \log(2) + 3\log(x)$.

That means we are essentially plotting plt.scatter(x, np.log(x)), which is not linear.

In order for this to be linear, we need to take the log of $x$ as well:

The relationship being visualized now is

$$\log(y) = \log(2) + 3 \log(x)$$

Kepler's third law

Details and data can be found on Wikipedia.

In fact, Kepler's law actually states that:

$$ T^2\propto R^3 $$

For Kepler this was a data-driven phenomenological law, formulated in 1619. It could only be explained dynamically once Newton introduced his law of universal gravitation in 1687.