Lecture 12 – Simple Linear Regression

by Suraj Rampure

Notebook credits:

  • Ani Adhikari
  • Data 8's textbook, chapter 15
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go

Correlation

First, let's come up with some examples of data to use in slides. (Normally this wouldn't be put in the notebook, but it might be of interest to you.)

Also, note here we use np.corrcoef here to compute the correlation coefficients, because we haven't yet defined what r is manually.

In [2]:
# Just noise
np.random.seed(43)
plt.figure(figsize = (4, 4))
plt.xticks([])
plt.yticks([])
plt.xlim(-3, 3)
plt.ylim(-3, 3)
x1, y1 = np.random.randn(2, 100)
plt.scatter(x1, y1, alpha = 0.75);
# plt.savefig('images/s1.png')
print(np.corrcoef(x1, y1))
[[ 1.        -0.1206646]
 [-0.1206646  1.       ]]
In [3]:
# Strong linear
np.random.seed(43)
plt.figure(figsize = (4, 4))
plt.xticks([])
plt.yticks([])
plt.xlim(-3, 3)
plt.ylim(-3, 3)
x2 = np.linspace(-3, 3, 100)
y2 = x2*0.5 - 1 + np.random.randn(100)*0.3
plt.scatter(x2, y2, alpha = 0.75);
# plt.savefig('images/s2.png')
print(np.corrcoef(x2, y2))
[[1.         0.94882263]
 [0.94882263 1.        ]]
In [4]:
# Strong non-linear
np.random.seed(43)
plt.figure(figsize = (4, 4))
plt.xticks([])
plt.yticks([])
plt.xlim(-3, 3)
plt.ylim(-3, 3)
x3 = np.linspace(-3, 3, 100)
y3 = 2*np.sin(x3 - 1.5) + np.random.randn(100)*0.3
plt.scatter(x3, y3, alpha = 0.75);
# plt.savefig('images/s3.png')
print(np.corrcoef(x3, y3))
[[1.         0.05201679]
 [0.05201679 1.        ]]
In [5]:
# Unequal spread
np.random.seed(43)
plt.figure(figsize = (4, 4))
plt.xticks([])
plt.yticks([])
plt.xlim(-3, 3)
plt.ylim(-3, 3)
x4 = np.linspace(-3, 3, 100)
y4 = x4/3 + np.random.randn(100)*(x4)/2.5
plt.scatter(x4, y4, alpha = 0.75);
# plt.savefig('images/s4.png')
print(np.corrcoef(x4, y4))
[[1.         0.70406276]
 [0.70406276 1.        ]]

Simple Linear Regression

First, let's implement the tools we'll need for regression.

In [6]:
def standard_units(x):
    return (x - np.mean(x)) / np.std(x)

def correlation(x, y):
    return np.mean(standard_units(x) * standard_units(y))

Let's read in our data.

In [7]:
df = pd.read_csv('galton.csv').iloc[:, 1:]
In [8]:
df
Out[8]:
parent child
0 70.5 61.7
1 68.5 61.7
2 65.5 61.7
3 64.5 61.7
4 64.0 61.7
... ... ...
923 69.5 73.7
924 69.5 73.7
925 69.5 73.7
926 69.5 73.7
927 69.5 73.7

928 rows × 2 columns

An interesting issue is that both our parent and child columns occur at fixed positions. We need to add some random noise, otherwise we'll suffer from gross overplotting.

In [9]:
df['parent'] = df['parent'] + np.random.randn(len(df))/2
df['child'] = df['child'] + np.random.randn(len(df))/2
In [10]:
fig = px.scatter(df, x= 'parent', y = 'child')
fig.show()

Using our correlation function:

In [11]:
correlation(df['parent'], df['child'])
Out[11]:
0.44042226098613824

Using an in-built correlation function:

In [12]:
np.corrcoef(df['parent'], df['child'])
Out[12]:
array([[1.        , 0.44042226],
       [0.44042226, 1.        ]])
In [13]:
df.corr()
Out[13]:
parent child
parent 1.000000 0.440422
child 0.440422 1.000000

All the same result.

What we now want to do is compute the average $y$ for a given $x$. A practical way to do this is to "bin" our x axis into 1-unit wide buckets, and then compute the average $y$ value for everything in that bucket. (We could choose bins of any width, though.)

In [14]:
def predict_mean_y(x):
    return df.loc[np.abs(df['parent'] - x) <= 0.5, 'child'].mean()
In [15]:
df['child_predicted'] = df['parent'].apply(predict_mean_y)

Now, let's look at our predictions:

In [16]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = df['parent'], y = df['child'], mode = 'markers', name = 'actual'))
fig.add_trace(go.Scatter(x = df['parent'], y = df['child_predicted'], mode = 'markers', name = 'predicted means', line=dict(color='gold')))
fig.update_layout(xaxis_title = 'MidParent Height', yaxis_title = 'Child Height')

Save for the tails where there are fewer values to draw from, it seems like our red predictions roughly follow a straight line piercing through the "middle" of our point cloud. That's our motivation for using a line to model this bivariate data.

Note: The cool thing about plotly is that you can hover over the points and it will tell you whether it is a prediction or actual value.

Now, it's time to implement the optimal coefficients.

In [17]:
def slope(x, y):
    return correlation(x, y) * np.std(y) / np.std(x)

def intercept(x, y):
    return np.mean(y) - slope(x, y)*np.mean(x)
In [18]:
ahat = intercept(df['parent'], df['child'])
bhat = slope(df['parent'], df['child'])

print("predicted y = {} + {} * average parent's height".format(np.round(ahat, 2), np.round(bhat, 2)))
predicted y = 26.51 + 0.61 * average parent's height

Let's see what our linear model looks like.

In [19]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = df['parent'], y = df['child'], mode = 'markers', name = 'actual'))
fig.add_trace(go.Scatter(x = df['parent'], y = df['child_predicted'], mode = 'markers', name = 'predicted means', line=dict(color='gold')))
fig.add_trace(go.Scatter(x = df['parent'], y = ahat + bhat*df['parent'], name = 'linear model', line=dict(color='red')))


fig.update_layout(xaxis_title = 'MidParent Height', yaxis_title = 'Child Height')

Visualizing Loss Surface

Let's look at what the loss surface for the above model looks like. Don't worry too much about what this code is doing.

In [20]:
def mse(y, yhat):
    return np.mean((y - yhat)**2)
In [21]:
# This function takes in our choice of [a, b] as a list, and returns
# the MSE for the corresponding linear model
def mse_for_height_model(t):
    a, b = t
    return mse(df['child'], a + b*df['parent'])
In [22]:
num_points = 200 # increase for better resolution, but it will run more slowly. 

# if (num_points <= 100):

uvalues = np.linspace(20, 32, num_points)
vvalues = np.linspace(-1, 3, num_points)
(u,v) = np.meshgrid(uvalues, vvalues)
thetas = np.vstack((u.flatten(),v.flatten()))

MSE = np.array([mse_for_height_model(t) for t in thetas.T])

loss_surface = go.Surface(x=u, y=v, z=np.reshape(MSE, u.shape))

opt_point = go.Scatter3d(x = [ahat], y = [bhat], z = [mse_for_height_model((ahat, bhat))],
            mode = 'markers', name = 'optimal parameters',
            marker=dict(size=10, color='gold'))

fig = go.Figure(data=[loss_surface])
fig.add_trace(opt_point)

fig.update_layout(scene = dict(
    xaxis_title = "theta0",
    yaxis_title = "theta1",
    zaxis_title = "MSE"))


fig.show()
# else:
#     print("Picking num points > 100 can be really slow. If you really want to try, edit the code above so that this if statement doesn't trigger.")