Raguvir Kunani and Isaac Schmidt, Summer 2021
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Let's import our data and see what we have.
grades = pd.read_csv('grades.csv')
grades.head(5)
| Lab Total | Discussion Total | Homework Total | Midterm | Final | |
|---|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 0.987646 | 0.896607 | 0.631746 |
| 1 | 1.0 | 1.0 | 0.817488 | 0.868057 | 0.770000 |
| 2 | 1.0 | 1.0 | 0.899066 | 0.678242 | 0.505476 |
| 3 | 1.0 | 1.0 | 1.000000 | 0.888889 | 0.964444 |
| 4 | 1.0 | 1.0 | 0.976408 | 0.912037 | 0.722381 |
grades.shape
(787, 5)
The first thing we need to do is center our data. We could standardize as well, but as each column is on roughly the same scale, we will not do so here.
means = np.mean(grades, axis = 0)
means
Lab Total 0.988739 Discussion Total 0.965216 Homework Total 0.907690 Midterm 0.735748 Final 0.603616 dtype: float64
grades_centered = grades - means
grades_centered.head()
| Lab Total | Discussion Total | Homework Total | Midterm | Final | |
|---|---|---|---|---|---|
| 0 | 0.011261 | 0.034784 | 0.079956 | 0.160860 | 0.028130 |
| 1 | 0.011261 | 0.034784 | -0.090202 | 0.132310 | 0.166384 |
| 2 | 0.011261 | 0.034784 | -0.008624 | -0.057506 | -0.098140 |
| 3 | 0.011261 | 0.034784 | 0.092310 | 0.153141 | 0.360829 |
| 4 | 0.011261 | 0.034784 | 0.068718 | 0.176289 | 0.118765 |
X = grades_centered[['Midterm', 'Final']].to_numpy()
X
array([[ 0.16085973, 0.02813011],
[ 0.13230973, 0.16638408],
[-0.05750601, -0.09813973],
...,
[ 0.18709399, 0.25368567],
[ 0.05977917, 0.212019 ],
[ 0.05283473, 0.03813011]])
Let's plot our data.
plt.figure(figsize = (4.5, 4), dpi = 100)
ax = sns.scatterplot(x = X[:, 0], y = X[:, 1], s = 5, alpha = 1)
ax.set_aspect('equal')
ax.set_xlabel('Midterm Exam')
ax.set_ylabel('Final Exam')
ax.set_ylim(-.55, .55)
ax.set_xlim(-.55, .55);
Let's calculate the covariance matrix for these two columns. Notice how $X^T X$ returned the same matrix as np.cov.
(X.T @ X) / len(X)
array([[0.02159435, 0.02052681],
[0.02052681, 0.03665306]])
cov = np.cov(X, rowvar = False, ddof = 0)
cov
array([[0.02159435, 0.02052681],
[0.02052681, 0.03665306]])
Now, let's determine the eigenvalues and eigenvalues of this matrix. We'll use np.linalg.eigh, which is a faster implementation than np.linalg.eig for symmetric matrices (which covariance matrices always are).
eigenvalues, eigenvectors = np.linalg.eigh(cov)
eigenvalues
array([0.00725956, 0.05098785])
eigenvectors
array([[-0.81986889, 0.57255131],
[ 0.57255131, 0.81986889]])
Now, we can plot the eigenvectores, scaled by their relative eigenvalues. Note that we've scaled up both eigenvectors by the same constant, so they are more readable on the plot.
scale_eigenvalues = 7.5
plt.figure(figsize = (4.5, 4), dpi = 100)
ax = sns.scatterplot(x = X[:, 0], y = X[:, 1], s = 5, alpha = 1)
ax.arrow(0, 0, scale_eigenvalues*eigenvalues[1] * eigenvectors[1, 0], scale_eigenvalues*eigenvalues[1] * eigenvectors[1, 1], head_width = .02, lw = 2, color = 'black')
ax.arrow(0, 0, scale_eigenvalues*eigenvalues[0] * eigenvectors[0, 0], scale_eigenvalues*eigenvalues[0] * eigenvectors[0, 1], head_width = .02, lw = 2, color = 'black')
ax.set_aspect('equal')
ax.set_xlabel('Midterm Exam')
ax.set_ylabel('Final Exam')
ax.set_ylim(-.55, .55)
ax.set_xlim(-.55, .55);
import plotly.express as px
alphas = np.arange(0, 360)
dfs = []
for alpha in alphas:
proj_vec = np.array([np.cos(alpha * np.pi / 180), np.sin(alpha * np.pi / 180)])
proj_vals = (X @ proj_vec) #/ np.linalg.norm(proj_vec) # dividing is redundant here
dfs.append(pd.DataFrame(data={
'Midterm': grades_centered['Midterm'],
'Final': grades_centered['Final'],
'Alpha': [alpha] * len(grades),
'Score': proj_vals
}))
scatter_df = pd.concat(dfs)
px.histogram(scatter_df,
x = 'Score',
animation_frame = 'Alpha',
histnorm = 'probability density',
range_x = [-0.75, 0.75],
range_y = [0, 5], nbins = 30,
title = 'Distribution of Scores for Different Linear Combinations')