feat: spread predicter cli

BlairCurrey · Feb 3, 2024 · dbb4a84 · dbb4a84
1 parent 4449fa4
commit dbb4a84
Show file tree

Hide file tree

Showing 15 changed files with 841 additions and 574 deletions.
diff --git a/README.md b/README.md
@@ -56,6 +56,7 @@ Then I would train the model on all the games I can on a game-by-game basis. So
   - could be useful for comparing the accuracy of my model. in particular "Distribution of the deviation of the final margin of victory from the Vegas spread"
   - for example, perhaps avg spread difference between vegas and reality is ~10 so a model with an average difference of 8 would be good
 - a concise little overview on features from a datascience.exchange comment about predicting matches (NOTE: not spread): https://datascience.stackexchange.com/questions/102827/how-to-predict-the-winner-of-a-future-sports-match
+- article on when you need to scale data for ml: https://www.baeldung.com/cs/normalization-vs-standardization
 
 # TODO:
 
@@ -89,6 +90,18 @@ score differential is wrong? look at first game. the number for the 2 teams dont
     - gradient boosting (better with non-linear)?
   - [x] some sort of basic analysis to see how it performed. including manually comparing to vegas spread (maybe I can find an average difference? https://www.theonlycolors.com/2020/9/29/21492301/vegas-always-knows-a-mathematical-deep-dive)
     - 9-10 pt avg difference (?). normaly distrubution means ~68% will be within 1 std deviation (identified as 14-15). could be a little lower because 1,2, etc. are within 14-15, but could be higher because ~32% will be more than 14-15.
+- [x] add function to create matchups from 2 teams so we can predict next week's games.
+  - using the running_avg df to merge, similar to how we're merging the game_id to get the final training df
+  - in practice the merged records should share a week but in theory they could be different (week 12 detroit vs. week 6 ravens etc.).
+- [x] cli
+  - [x] download data
+  - [x] train model
+    - what to do with it? save configuration then recreate it when needed? pickle?
+  - [x] predict spread
+- [ ] github workflow
+  - [ ] periodically update the data (and release?)
+  - [ ] periodically train the model (and release? what? the configuration... as what filetype? json?)
+  - [ ] periodically get upcoming games and make predictions. publish on github pages. get booky spread too?
 - [ ] improve features/model. either at game aggregation level or team @ week aggregation level
   - [ ] W/L record or games played and win pct? (win and loss column on game aggregation)
   - [ ] success rate (calculate success (0 or 1) from each play).
@@ -97,9 +110,11 @@ score differential is wrong? look at first game. the number for the 2 teams dont
   - [x] total points scored/allowed
   - [ ] maybe dont use first ~3 games? small sample size but dont want to throw out too much data.
   - [ ] games played (could be used as confidence in record/stats)
-- [x] add function to create matchups from 2 teams so we can predict next week's games.
-  - using the running_avg df to merge, similar to how we're merging the game_id to get the final training df
-  - in practice the merged records should share a week but in theory they could be different (week 12 detroit vs. week 6 ravens etc.).
+- [ ] rethink exposing build_running_avg_dataframe, build_training_dataframe instead of doing that inside train_model (with side effect of saving the build_running_avg_dataframe (to disk?) somewhere).
+  - just need to see how its actually used
+  - I guess its good for development purposes? maybe just make the df arg in train_model(df) optional and build from ground up if not provided which will be used in cli/deployment but developing can pass it df? idk
+- [ ] write script that gets upcoming games and makes prediction from model.
+  - try to find a good source for the schedule (nflfastR for that too maybe?).
 
 # Current status:
 
@@ -123,6 +138,29 @@ score differential is wrong? look at first game. the number for the 2 teams dont
 
 # Stray thoughts:
 
+- model name idea: caliper. (like measuring the "spread")
+- save model by pickeling with joblib/dump or save the configuration like:
+
+```python
+ # Save essential components (assumes linreg - does it work the same for others?)
+ coefficients = model.coef_
+ intercept = model.intercept_
+ # assumes using minmaxscaler (but maybe im not)
+ scaler_params = {'min_values': scaler.min_, 'scale_values': scaler.scale_}
+
+ # Recreate the model
+ recreated_model = LinearRegression()
+ recreated_model.coef_ = coefficients
+ recreated_model.intercept_ = intercept
+
+ # Recreate the scaler
+ recreated_scaler = MinMaxScaler()
+ recreated_scaler.min_ = scaler_params['min_values']
+ recreated_scaler.scale_ = scaler_params['scale_values']
+```
+
+- I think saving the configuration is probably better if I can.
+
 - What should the model's be guess _exactly_ and what does that say about how the teams are modeled in the input? the spread consists of 2 numbers (usually the inverse of each). 1 for each team. Maybe just predict the hometeam?
   - probably need to squash 2 teams into 1 line like: home_team_pass_off, home_team_pass_def, away_team_pass_off, away_team_pass_def, etc.
 - Are lots of features bad? What about redundant or mostly redundant features (pass yards, rush yards, total yards (total yards are either equal or very similar to pass+rush yards)). Which should I pick in that case (probably the less aggregated ones)?

diff --git a/nfl_analytics/assets/trained_model.joblib b/nfl_analytics/assets/trained_model.joblib
diff --git a/nfl_analytics/assets/trained_scaler.joblib b/nfl_analytics/assets/trained_scaler.joblib
diff --git a/nfl_analytics/config.py b/nfl_analytics/config.py
@@ -0,0 +1,53 @@
+DATA_DIR = "data"
+ASSET_DIR = "assets"
+START_YEAR = 1999
+FEATURES = [
+    "away_rushing_avg",
+    "home_rushing_avg",
+    "away_passing_avg",
+    "home_passing_avg",
+    "away_sack_yards_avg",
+    "home_sack_yards_avg",
+    "away_score_differential_post_avg",
+    "home_score_differential_post_avg",
+    "away_points_scored_avg",
+    "home_points_scored_avg",
+    "away_points_allowed_avg",
+    "home_points_allowed_avg",
+    "away_mean_epa_avg",
+    "home_mean_epa_avg",
+]
+TEAMS = [
+    "WAS",
+    "ARI",
+    "BUF",
+    "NYJ",
+    "ATL",
+    "CAR",
+    "CIN",
+    "CLE",
+    "NYG",
+    "DAL",
+    "DET",
+    "KC",
+    "CHI",
+    "GB",
+    "BAL",
+    "HOU",
+    "IND",
+    "JAX",
+    "SEA",
+    "LA",
+    "LV",
+    "DEN",
+    "MIA",
+    "LAC",
+    "PHI",
+    "NE",
+    "PIT",
+    "SF",
+    "MIN",
+    "TB",
+    "NO",
+    "TEN",
+]
diff --git a/nfl_analytics/data.py b/nfl_analytics/data.py
@@ -1,32 +1,65 @@
+"""
+Handles fetching and loading the play by play data. Essentially, 
+everything before tranforming it.
+"""
+
 import urllib.request
+from urllib.error import HTTPError
 import os
-import pandas as pd
 import sqlite3
 
+import pandas as pd
 
-def get():
-    years = range(1999, 2024)
+from nfl_analytics.config import DATA_DIR
 
-    save_directory = "data"
-    os.makedirs(save_directory, exist_ok=True)
+
+def download_data(years=range(1999, 2024)):
+    os.makedirs(DATA_DIR, exist_ok=True)
 
     for year in years:
         # year gets parsed from this filename and depends on this format
         filename = f"play_by_play_{year}.csv.gz"
         url = f"https://github.com/nflverse/nflverse-data/releases/download/pbp/{filename}"
-        save_path = os.path.join(save_directory, filename)
+        save_path = os.path.join(DATA_DIR, filename)
 
         print(f"Downloading {url}")
-        urllib.request.urlretrieve(url, save_path)
+
+        try:
+            urllib.request.urlretrieve(url, save_path)
+        except HTTPError as e:
+            print(
+                f"Error: Failed to download data for {year}. HTTP Error {e.code}: {e.reason}. Season for that year may not exist yet."
+            )
 
 
-def load_pandas():
+def load_dataframe():
     script_dir = os.path.dirname(os.path.abspath(__file__))
-    data_directory = os.path.join(script_dir, "data")
+    data_directory = os.path.join(script_dir, DATA_DIR)
 
+    if not os.path.exists(data_directory):
+        raise FileNotFoundError(f"Data directory '{data_directory}' not found.")
+
+    files = os.listdir(data_directory)
+
+    if not files:
+        raise FileNotFoundError(f"No data files found in the data directory.")
+
+    # This wont pick on updated data (downlaoded new data but still have combined, so it will use that)
+    # # load saved combined from disk if exists
+    # combined_file_path = os.path.join(
+    #     data_directory, "combined", "play_by_play_combined.parquet.gzip"
+    # )
+    # if not skip_combined and os.path.exists(combined_file_path):
+    #     print(f"Reading combined file {combined_file_path}")
+    #     combined_df = pd.read_parquet(combined_file_path)
+    #     return combined_df
+    # else:
+    #     print("Combined file does not exist. Loading individual files.")
+
+    # make combined dataframe from individual files
     combined_df = pd.DataFrame()
 
-    for filename in os.listdir(data_directory):
+    for filename in files:
         if filename.endswith(".csv.gz"):
             print(f"Reading {filename}")
             file_path = os.path.join(data_directory, filename)
@@ -37,6 +70,9 @@ def load_pandas():
             df["year"] = year
             combined_df = pd.concat([combined_df, df], ignore_index=True)
 
+    if combined_df.empty:
+        raise FileNotFoundError("No data loaded from the files.")
+
     return combined_df
 
 
@@ -46,43 +82,21 @@ def get_year_from_filename(filename):
 
 
 def load_sqlite():
+    db_dir = "/tmp/nfl-analytics.db"
     # load into pandas first and use to_sql to infer datatypes
-    df = load_pandas()
+    df = load_dataframe()
+
+    print(f"Loading into SQLite database: {db_dir}")
 
     table_name = "plays"
-    db_conn = sqlite3.connect(database="/tmp/nfl-analytics.db")
-    # TODO: remove drop table after developing?
+    db_conn = sqlite3.connect(database=db_dir)
     db_conn.execute(f"DROP TABLE IF EXISTS {table_name}")
     df.to_sql(table_name, db_conn, index=False)
 
     cursor = db_conn.execute(f"SELECT * from {table_name} LIMIT 10")
     print(cursor.fetchall())
 
 
-# def build():
-#     # TODO: do all the things the dev notebook is doing. splitting into nice functions as necessary
-#     # For example, could make a function for each time in notebook we are initializing a new dataframe (just a rough guide).
-#     pass
-
-
-class Pipeline:
-    def __init__(self, debug=False):
-        self.debug = debug
-        # self.df = pd.DataFrame()
-
-    def _fetch_play_by_play(self, years=range(1999, 2024)):
-        pass
-
-    def _load(self):
-        pass
-
-    def _build(self):
-        pass
-
-    # def stuffthatbuildcalls (so I can run in the dev notebook)
-    # if debug: true, print stuff
-
-
 if __name__ == "__main__":
-    get()
+    download_data()
     load_sqlite()