feat: produce working model

BlairCurrey · Jan 29, 2024 · ec67c85 · ec67c85
1 parent 0c4c7de
commit ec67c85
Show file tree

Hide file tree

Showing 7 changed files with 4,973 additions and 1,325 deletions.
diff --git a/README.md b/README.md
@@ -55,6 +55,7 @@ Then I would train the model on all the games I can on a game-by-game basis. So
 - article on accuracy of vegas odds: https://www.theonlycolors.com/2020/9/29/21492301/vegas-always-knows-a-mathematical-deep-dive
   - could be useful for comparing the accuracy of my model. in particular "Distribution of the deviation of the final margin of victory from the Vegas spread"
   - for example, perhaps avg spread difference between vegas and reality is ~10 so a model with an average difference of 8 would be good
+- a concise little overview on features from a datascience.exchange comment about predicting matches (NOTE: not spread): https://datascience.stackexchange.com/questions/102827/how-to-predict-the-winner-of-a-future-sports-match
 
 # TODO:
 
@@ -68,43 +69,68 @@ Then I would train the model on all the games I can on a game-by-game basis. So
   - [x] install ipykernel and use venv/kernal in notebook. can use this get venv location `poetry show -v`
   - [x] refactor data get/load into functions i can import.
     - path error... relative paths not resolving when importing
-- [ ] aggregate game stats from play by play for each game, for each season. keep the features simple for now.
+- [x] aggregate game stats from play by play for each game, for each season. keep the features simple for now.
   - [x] not sure if this should be dataframe or sqlite table maybe pandas but also sqlite for dev purposes (if still not using notebook).
     - dataframe for sure to make it easier to develop.
-  - [ ] figure out how to model it
-    - Season dataframe with each row being a game. WeekNumber, Team1, Team2, Team1Score, Team2Score, Team1PassYards, Team2PassYards, etc.
-      - could be Home and Away team instead of 1,2
-      - not sure if squashing to 1 row is the best way to handle teams? Could have seperate row for each team per game (so two rows per game, 1 for each team)
-        - Say I want to get average team stats through week 10 for the bears. I have to group by either column at the same time? maybe difficult/impossible?
-        - I can always smash them into 1 record later for the training input too...
-      - I think season is too aggregated... i need individual games so i can accumulate them through a given number of weeks for any given season.
+  - [x] figure out how to model it
     - Games dataframe with:
-      - Team, pass off, pass def, rush off, rush def, average EPA
-- [ ] use aggregated team stats to get average stats for each team coming into each game. this will be
-  - should be able to use to query something like `"SELECT AVERAGE(pass_yards) as avg_pass_yards, etc. FROM Season where Team1 = 'CHI' or Team2 = 'CHI' and week_number < 10"`
-    - this would be the input to the model to predict the week 10 game for the chicago bears
-    - this is going to be a lot of duplicate aggregation if I do this for each week. may be a good way to use previously computed averages. maybe storing totals and counts for each stat in a separate table and then computing averages from that table.
+      - Team, pass off, pass def, rush off, rush def, average EPA, score spread
+- [x] use aggregated team stats to get average stats for each team coming into each game. this will be
+
+score differential is wrong? look at first game. the number for the 2 teams dont match
+
 - [ ] cleanup
   - [ ] the get_data and load_data is duplicated in data.py and get_data.py/load_data.py. just use one or the other.
   - [ ] move notebook code to python files. think about a managable way to share logic between notebook and python files so I can drop into the pipeline and inspect as needed.
     - probably just put everything in a functions that are imported into the python file and notebook?
-- [ ] simple model to predict spread
-  - [ ] use sklearn to train model (linear regression, although maybe not ideal because relationship between features (pass off vs. opponent pass def))
-  - [ ] some sort of basic analysis to see how it performed. including manually comparing to vegas spread (maybe I can find an average difference? https://www.theonlycolors.com/2020/9/29/21492301/vegas-always-knows-a-mathematical-deep-dive)
-- [ ] improve features. either at game aggregation level or team @ week aggregation level
-  - W/L record or games played and win pct?
-  - success rate (calculate success (0 or 1) from each play).
+- [x] simple model to predict spread
+  - [x] use sklearn to train model
+    - could do linear regression, although maybe not ideal because relationship between features (pass off vs. opponent pass def)
+    - gradient boosting (better with non-linear)?
+  - [x] some sort of basic analysis to see how it performed. including manually comparing to vegas spread (maybe I can find an average difference? https://www.theonlycolors.com/2020/9/29/21492301/vegas-always-knows-a-mathematical-deep-dive)
+    - 9-10 pt avg difference (?). normaly distrubution means ~68% will be within 1 std deviation (identified as 14-15). could be a little lower because 1,2, etc. are within 14-15, but could be higher because ~32% will be more than 14-15.
+- [ ] improve features/model. either at game aggregation level or team @ week aggregation level
+  - [ ] W/L record or games played and win pct? (win and loss column on game aggregation)
+  - [ ] success rate (calculate success (0 or 1) from each play).
     - could be measured from positive EPA. or 40% of yards to go on 1st, 70% on 2nd, 100% on 3rd/4th (or similar)
-  - total points scored/allowed
+    - actually it looks like there are some success rate fields?
+  - [x] total points scored/allowed
+  - [ ] maybe dont use first ~3 games? small sample size but dont want to throw out too much data.
+  - [ ] games played (could be used as confidence in record/stats)
+- [x] add function to create matchups from 2 teams so we can predict next week's games.
+  - using the running_avg df to merge, similar to how we're merging the game_id to get the final training df
+  - in practice the merged records should share a week but in theory they could be different (week 12 detroit vs. week 6 ravens etc.).
 
 # Current status:
 
 - project managed by poetry
 - Can fetch all play by play data in compressed csv format from ` nflfastR`` releases using  `get_data.py``
 - Can load raw data into pandas dataframe and sqlite3 using load_data.py. data types are inferred by pandas and sqlite is created from pandas method. this is key because there are a lot of columns (372 at time of writing). When identifying actual columns I want to use I can get more specific with data types.
+- Aggregates basic pbp stats into game stats
+- Aggregates basic game stats into rollwing averages for each team/year
+- Aggregates rolling weekly averages by team
+- trains linear regression model to predict spread
+- can get matchup from rolling averages for future matches and predict spread
+
+# Next steps (beyond TODO):
+
+- "Blogpost" (or github md readme.)
+- Possible change of direction: multilabel classifer to predict if a matchup is normal or an outlier if overdog wins by a lot more than expected or underdog performs much better than expected. Then use that to pick the over/underdog to cover the spread (or to pass on the game)
+  - Not sure of labels (overdog_performance: normal, underperform, overperform).
+  - Find games where spread (included in dataset) is off and then train a model to classify them. Input can be the same team stats as the spread predictor, but the dataset will be limited to just the games that are off.
+  - outlier definition tbd. perhaps abs(real spread - booky spread) is > 1 std deviation? which is 14-15 points according to some article I found (in resource section). In that case I'd expect ~68% of games to be "normal" and ~32% to be outliers.
+  - Crazy idea: After implmeneting, go back and use this as a feature on the spread. Classify each games as (normal,outperform,underperform) and use that as a feature to train with. Perhaps this will be useful even if the accuracy is somewhat low?
+
+# Stray thoughts:
+
+- What should the model's be guess _exactly_ and what does that say about how the teams are modeled in the input? the spread consists of 2 numbers (usually the inverse of each). 1 for each team. Maybe just predict the hometeam?
+  - probably need to squash 2 teams into 1 line like: home_team_pass_off, home_team_pass_def, away_team_pass_off, away_team_pass_def, etc.
+- Are lots of features bad? What about redundant or mostly redundant features (pass yards, rush yards, total yards (total yards are either equal or very similar to pass+rush yards)). Which should I pick in that case (probably the less aggregated ones)?
 
 # Modeling ideas
 
+## Running Averages By Week
+
 Perhaps somethinglike this for aggregating stats
 
 Bears pass for 170 week 1, and 220 week 2. pass_off_running_avg is the running average, so their average pass yards/game coming into week 3 is 195. This would be the input to the model for predicting the week 3 game along with the bookie spread (and other team).
@@ -119,7 +145,19 @@ Bears pass for 170 week 1, and 220 week 2. pass_off_running_avg is the running a
 | 2023   | 3    | CHI  | 195                  |        |
 | etc... |      |      |                      |        |
 
-Then I could do SELECT pass_off_running_avg FROM Season where Team = 'CHI' week = 3
+Then I could do something like SELECT pass_off_running_avg FROM Season where Team = 'CHI' week = 3 to get the averages for a given week.
+
+## Final input
+
+Maybe something like this where:
+
+- spread is for home team
+- everything on 1 line with home and away team stats (running average from season start)
+- maybe even exclude home/away team? shouldnt really factor them in right? would home field advantage be a factor here?
+  GameID,Date,HomeTeam,AwayTeam,HomePassOff,AwayPassOff,HomeRushOff,AwayRushOff,HomePassDef,AwayPassDef,HomeRushDef,AwayRushDef,Spread
+  1,2022-09-10,TeamA,TeamB,250,220,120,100,180,200,90,110,3
+  2,2022-09-10,TeamC,TeamD,280,240,130,110,200,180,95,100,-7
+  3,2022-09-11,TeamE,TeamF,220,200,110,90,170,190,85,120,5
 
 # possible spread -> win pct map
 

diff --git a/nfl_analytics/data.py b/nfl_analytics/data.py
@@ -59,6 +59,30 @@ def load_sqlite():
     print(cursor.fetchall())
 
 
+# def build():
+#     # TODO: do all the things the dev notebook is doing. splitting into nice functions as necessary
+#     # For example, could make a function for each time in notebook we are initializing a new dataframe (just a rough guide).
+#     pass
+
+
+class Pipeline:
+    def __init__(self, debug=False):
+        self.debug = debug
+        # self.df = pd.DataFrame()
+
+    def _fetch_play_by_play(self, years=range(1999, 2024)):
+        pass
+
+    def _load(self):
+        pass
+
+    def _build(self):
+        pass
+
+    # def stuffthatbuildcalls (so I can run in the dev notebook)
+    # if debug: true, print stuff
+
+
 if __name__ == "__main__":
     get()
     load_sqlite()