import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates["plotly"].layout.colorway = px.colors.qualitative.Vivid
px.defaults.width = 800
from scipy.optimize import minimize
import sklearn.linear_model as lm
from sklearn.metrics import r2_score
In this lecture, we will work with the basketball
dataset, which contains information about basketball games played in the NBA. In the cell below, we perform data cleaning to transform the data into a useful form, which we store as the DataFrame games
.
Our goal in this portion of the lecture is to predict whether or not a team wins their game ("WON"
) given their "GOAL_DIFF"
. The variable "GOAL_DIFF"
represents the difference in successful field goal rates between the two teams competing in a game. A positive value for "GOAL_DIFF"
means that a team made more field goals than their opponent; a negative value means that the opponent made more field goals.
basketball = pd.read_csv("data/nba.csv")
basketball.head()
SEASON_ID | TEAM_ID | TEAM_ABBREVIATION | TEAM_NAME | GAME_ID | GAME_DATE | MATCHUP | WL | MIN | FGM | ... | DREB | REB | AST | STL | BLK | TOV | PF | PTS | PLUS_MINUS | VIDEO_AVAILABLE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22017 | 1610612744 | GSW | Golden State Warriors | 21700002 | 2017-10-17 | GSW vs. HOU | L | 240 | 43 | ... | 35 | 41 | 34 | 5 | 9 | 17 | 25 | 121 | -1 | 1 |
1 | 22017 | 1610612745 | HOU | Houston Rockets | 21700002 | 2017-10-17 | HOU @ GSW | W | 240 | 47 | ... | 33 | 43 | 28 | 9 | 5 | 13 | 16 | 122 | 1 | 1 |
2 | 22017 | 1610612738 | BOS | Boston Celtics | 21700001 | 2017-10-17 | BOS @ CLE | L | 240 | 36 | ... | 37 | 46 | 24 | 11 | 4 | 12 | 24 | 99 | -3 | 1 |
3 | 22017 | 1610612739 | CLE | Cleveland Cavaliers | 21700001 | 2017-10-17 | CLE vs. BOS | W | 240 | 38 | ... | 41 | 50 | 19 | 3 | 4 | 17 | 25 | 102 | 3 | 1 |
4 | 22017 | 1610612750 | MIN | Minnesota Timberwolves | 21700011 | 2017-10-18 | MIN @ SAS | L | 240 | 37 | ... | 31 | 42 | 23 | 7 | 4 | 13 | 16 | 99 | -8 | 1 |
5 rows × 29 columns
basketball = pd.read_csv("data/nba.csv")
first_team = basketball.groupby("GAME_ID").first()
second_team = basketball.groupby("GAME_ID").last()
games = first_team.merge(second_team, left_index = True, right_index = True, suffixes = ["", "_OPP"])
games['GOAL_DIFF'] = games["FG_PCT"] - games["FG_PCT_OPP"]
games['WON'] = (games['WL'] == "W").astype(int)
games = games[['TEAM_NAME', 'TEAM_NAME_OPP', 'MATCHUP', 'WON', 'WL', 'GOAL_DIFF']]
games
TEAM_NAME | TEAM_NAME_OPP | MATCHUP | WON | WL | GOAL_DIFF | |
---|---|---|---|---|---|---|
GAME_ID | ||||||
21700001 | Boston Celtics | Cleveland Cavaliers | BOS @ CLE | 0 | L | -0.049 |
21700002 | Golden State Warriors | Houston Rockets | GSW vs. HOU | 0 | L | 0.053 |
21700003 | Charlotte Hornets | Detroit Pistons | CHA @ DET | 0 | L | -0.030 |
21700004 | Indiana Pacers | Brooklyn Nets | IND vs. BKN | 1 | W | 0.041 |
21700005 | Orlando Magic | Miami Heat | ORL vs. MIA | 1 | W | 0.042 |
... | ... | ... | ... | ... | ... | ... |
21701226 | New Orleans Pelicans | San Antonio Spurs | NOP vs. SAS | 1 | W | 0.189 |
21701227 | Oklahoma City Thunder | Memphis Grizzlies | OKC vs. MEM | 1 | W | 0.069 |
21701228 | LA Clippers | Los Angeles Lakers | LAC vs. LAL | 0 | L | 0.017 |
21701229 | Utah Jazz | Portland Trail Blazers | UTA @ POR | 0 | L | -0.090 |
21701230 | Houston Rockets | Sacramento Kings | HOU @ SAC | 0 | L | -0.097 |
1230 rows × 6 columns
Logistic Regression¶
If we visualize our data, we see a very different result from the scatter plots we have worked with for linear regression. Because a team can only win or lose a game, the only possible values of "WON"
are 1 (if the team won the game) or 0 (if the team lost).
px.scatter(games,
x="GOAL_DIFF", y="WON", color="WL",
hover_data=['TEAM_NAME', 'TEAM_NAME_OPP'])