Skip to content

Commit

Permalink
feat: spread predicter cli
Browse files Browse the repository at this point in the history
  • Loading branch information
BlairCurrey committed Feb 3, 2024
1 parent 4449fa4 commit dbb4a84
Show file tree
Hide file tree
Showing 15 changed files with 841 additions and 574 deletions.
44 changes: 41 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Then I would train the model on all the games I can on a game-by-game basis. So
- could be useful for comparing the accuracy of my model. in particular "Distribution of the deviation of the final margin of victory from the Vegas spread"
- for example, perhaps avg spread difference between vegas and reality is ~10 so a model with an average difference of 8 would be good
- a concise little overview on features from a datascience.exchange comment about predicting matches (NOTE: not spread): https://datascience.stackexchange.com/questions/102827/how-to-predict-the-winner-of-a-future-sports-match
- article on when you need to scale data for ml: https://www.baeldung.com/cs/normalization-vs-standardization

# TODO:

Expand Down Expand Up @@ -89,6 +90,18 @@ score differential is wrong? look at first game. the number for the 2 teams dont
- gradient boosting (better with non-linear)?
- [x] some sort of basic analysis to see how it performed. including manually comparing to vegas spread (maybe I can find an average difference? https://www.theonlycolors.com/2020/9/29/21492301/vegas-always-knows-a-mathematical-deep-dive)
- 9-10 pt avg difference (?). normaly distrubution means ~68% will be within 1 std deviation (identified as 14-15). could be a little lower because 1,2, etc. are within 14-15, but could be higher because ~32% will be more than 14-15.
- [x] add function to create matchups from 2 teams so we can predict next week's games.
- using the running_avg df to merge, similar to how we're merging the game_id to get the final training df
- in practice the merged records should share a week but in theory they could be different (week 12 detroit vs. week 6 ravens etc.).
- [x] cli
- [x] download data
- [x] train model
- what to do with it? save configuration then recreate it when needed? pickle?
- [x] predict spread
- [ ] github workflow
- [ ] periodically update the data (and release?)
- [ ] periodically train the model (and release? what? the configuration... as what filetype? json?)
- [ ] periodically get upcoming games and make predictions. publish on github pages. get booky spread too?
- [ ] improve features/model. either at game aggregation level or team @ week aggregation level
- [ ] W/L record or games played and win pct? (win and loss column on game aggregation)
- [ ] success rate (calculate success (0 or 1) from each play).
Expand All @@ -97,9 +110,11 @@ score differential is wrong? look at first game. the number for the 2 teams dont
- [x] total points scored/allowed
- [ ] maybe dont use first ~3 games? small sample size but dont want to throw out too much data.
- [ ] games played (could be used as confidence in record/stats)
- [x] add function to create matchups from 2 teams so we can predict next week's games.
- using the running_avg df to merge, similar to how we're merging the game_id to get the final training df
- in practice the merged records should share a week but in theory they could be different (week 12 detroit vs. week 6 ravens etc.).
- [ ] rethink exposing build_running_avg_dataframe, build_training_dataframe instead of doing that inside train_model (with side effect of saving the build_running_avg_dataframe (to disk?) somewhere).
- just need to see how its actually used
- I guess its good for development purposes? maybe just make the df arg in train_model(df) optional and build from ground up if not provided which will be used in cli/deployment but developing can pass it df? idk
- [ ] write script that gets upcoming games and makes prediction from model.
- try to find a good source for the schedule (nflfastR for that too maybe?).

# Current status:

Expand All @@ -123,6 +138,29 @@ score differential is wrong? look at first game. the number for the 2 teams dont

# Stray thoughts:

- model name idea: caliper. (like measuring the "spread")
- save model by pickeling with joblib/dump or save the configuration like:

```python
# Save essential components (assumes linreg - does it work the same for others?)
coefficients = model.coef_
intercept = model.intercept_
# assumes using minmaxscaler (but maybe im not)
scaler_params = {'min_values': scaler.min_, 'scale_values': scaler.scale_}

# Recreate the model
recreated_model = LinearRegression()
recreated_model.coef_ = coefficients
recreated_model.intercept_ = intercept

# Recreate the scaler
recreated_scaler = MinMaxScaler()
recreated_scaler.min_ = scaler_params['min_values']
recreated_scaler.scale_ = scaler_params['scale_values']
```

- I think saving the configuration is probably better if I can.

- What should the model's be guess _exactly_ and what does that say about how the teams are modeled in the input? the spread consists of 2 numbers (usually the inverse of each). 1 for each team. Maybe just predict the hometeam?
- probably need to squash 2 teams into 1 line like: home_team_pass_off, home_team_pass_def, away_team_pass_off, away_team_pass_def, etc.
- Are lots of features bad? What about redundant or mostly redundant features (pass yards, rush yards, total yards (total yards are either equal or very similar to pass+rush yards)). Which should I pick in that case (probably the less aggregated ones)?
Expand Down
Binary file added nfl_analytics/assets/trained_model.joblib
Binary file not shown.
Binary file added nfl_analytics/assets/trained_scaler.joblib
Binary file not shown.
53 changes: 53 additions & 0 deletions nfl_analytics/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
DATA_DIR = "data"
ASSET_DIR = "assets"
START_YEAR = 1999
FEATURES = [
"away_rushing_avg",
"home_rushing_avg",
"away_passing_avg",
"home_passing_avg",
"away_sack_yards_avg",
"home_sack_yards_avg",
"away_score_differential_post_avg",
"home_score_differential_post_avg",
"away_points_scored_avg",
"home_points_scored_avg",
"away_points_allowed_avg",
"home_points_allowed_avg",
"away_mean_epa_avg",
"home_mean_epa_avg",
]
TEAMS = [
"WAS",
"ARI",
"BUF",
"NYJ",
"ATL",
"CAR",
"CIN",
"CLE",
"NYG",
"DAL",
"DET",
"KC",
"CHI",
"GB",
"BAL",
"HOU",
"IND",
"JAX",
"SEA",
"LA",
"LV",
"DEN",
"MIA",
"LAC",
"PHI",
"NE",
"PIT",
"SF",
"MIN",
"TB",
"NO",
"TEN",
]
90 changes: 52 additions & 38 deletions nfl_analytics/data.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,65 @@
"""
Handles fetching and loading the play by play data. Essentially,
everything before tranforming it.
"""

import urllib.request
from urllib.error import HTTPError
import os
import pandas as pd
import sqlite3

import pandas as pd

def get():
years = range(1999, 2024)
from nfl_analytics.config import DATA_DIR

save_directory = "data"
os.makedirs(save_directory, exist_ok=True)

def download_data(years=range(1999, 2024)):
os.makedirs(DATA_DIR, exist_ok=True)

for year in years:
# year gets parsed from this filename and depends on this format
filename = f"play_by_play_{year}.csv.gz"
url = f"https://github.com/nflverse/nflverse-data/releases/download/pbp/{filename}"
save_path = os.path.join(save_directory, filename)
save_path = os.path.join(DATA_DIR, filename)

print(f"Downloading {url}")
urllib.request.urlretrieve(url, save_path)

try:
urllib.request.urlretrieve(url, save_path)
except HTTPError as e:
print(
f"Error: Failed to download data for {year}. HTTP Error {e.code}: {e.reason}. Season for that year may not exist yet."
)


def load_pandas():
def load_dataframe():
script_dir = os.path.dirname(os.path.abspath(__file__))
data_directory = os.path.join(script_dir, "data")
data_directory = os.path.join(script_dir, DATA_DIR)

if not os.path.exists(data_directory):
raise FileNotFoundError(f"Data directory '{data_directory}' not found.")

files = os.listdir(data_directory)

if not files:
raise FileNotFoundError(f"No data files found in the data directory.")

# This wont pick on updated data (downlaoded new data but still have combined, so it will use that)
# # load saved combined from disk if exists
# combined_file_path = os.path.join(
# data_directory, "combined", "play_by_play_combined.parquet.gzip"
# )
# if not skip_combined and os.path.exists(combined_file_path):
# print(f"Reading combined file {combined_file_path}")
# combined_df = pd.read_parquet(combined_file_path)
# return combined_df
# else:
# print("Combined file does not exist. Loading individual files.")

# make combined dataframe from individual files
combined_df = pd.DataFrame()

for filename in os.listdir(data_directory):
for filename in files:
if filename.endswith(".csv.gz"):
print(f"Reading {filename}")
file_path = os.path.join(data_directory, filename)
Expand All @@ -37,6 +70,9 @@ def load_pandas():
df["year"] = year
combined_df = pd.concat([combined_df, df], ignore_index=True)

if combined_df.empty:
raise FileNotFoundError("No data loaded from the files.")

return combined_df


Expand All @@ -46,43 +82,21 @@ def get_year_from_filename(filename):


def load_sqlite():
db_dir = "/tmp/nfl-analytics.db"
# load into pandas first and use to_sql to infer datatypes
df = load_pandas()
df = load_dataframe()

print(f"Loading into SQLite database: {db_dir}")

table_name = "plays"
db_conn = sqlite3.connect(database="/tmp/nfl-analytics.db")
# TODO: remove drop table after developing?
db_conn = sqlite3.connect(database=db_dir)
db_conn.execute(f"DROP TABLE IF EXISTS {table_name}")
df.to_sql(table_name, db_conn, index=False)

cursor = db_conn.execute(f"SELECT * from {table_name} LIMIT 10")
print(cursor.fetchall())


# def build():
# # TODO: do all the things the dev notebook is doing. splitting into nice functions as necessary
# # For example, could make a function for each time in notebook we are initializing a new dataframe (just a rough guide).
# pass


class Pipeline:
def __init__(self, debug=False):
self.debug = debug
# self.df = pd.DataFrame()

def _fetch_play_by_play(self, years=range(1999, 2024)):
pass

def _load(self):
pass

def _build(self):
pass

# def stuffthatbuildcalls (so I can run in the dev notebook)
# if debug: true, print stuff


if __name__ == "__main__":
get()
download_data()
load_sqlite()
Loading

0 comments on commit dbb4a84

Please sign in to comment.