Skip to content

Commit

Permalink
feat: produce working model
Browse files Browse the repository at this point in the history
  • Loading branch information
BlairCurrey committed Jan 29, 2024
1 parent 0c4c7de commit ec67c85
Show file tree
Hide file tree
Showing 7 changed files with 4,973 additions and 1,325 deletions.
80 changes: 59 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ Then I would train the model on all the games I can on a game-by-game basis. So
- article on accuracy of vegas odds: https://www.theonlycolors.com/2020/9/29/21492301/vegas-always-knows-a-mathematical-deep-dive
- could be useful for comparing the accuracy of my model. in particular "Distribution of the deviation of the final margin of victory from the Vegas spread"
- for example, perhaps avg spread difference between vegas and reality is ~10 so a model with an average difference of 8 would be good
- a concise little overview on features from a datascience.exchange comment about predicting matches (NOTE: not spread): https://datascience.stackexchange.com/questions/102827/how-to-predict-the-winner-of-a-future-sports-match

# TODO:

Expand All @@ -68,43 +69,68 @@ Then I would train the model on all the games I can on a game-by-game basis. So
- [x] install ipykernel and use venv/kernal in notebook. can use this get venv location `poetry show -v`
- [x] refactor data get/load into functions i can import.
- path error... relative paths not resolving when importing
- [ ] aggregate game stats from play by play for each game, for each season. keep the features simple for now.
- [x] aggregate game stats from play by play for each game, for each season. keep the features simple for now.
- [x] not sure if this should be dataframe or sqlite table maybe pandas but also sqlite for dev purposes (if still not using notebook).
- dataframe for sure to make it easier to develop.
- [ ] figure out how to model it
- Season dataframe with each row being a game. WeekNumber, Team1, Team2, Team1Score, Team2Score, Team1PassYards, Team2PassYards, etc.
- could be Home and Away team instead of 1,2
- not sure if squashing to 1 row is the best way to handle teams? Could have seperate row for each team per game (so two rows per game, 1 for each team)
- Say I want to get average team stats through week 10 for the bears. I have to group by either column at the same time? maybe difficult/impossible?
- I can always smash them into 1 record later for the training input too...
- I think season is too aggregated... i need individual games so i can accumulate them through a given number of weeks for any given season.
- [x] figure out how to model it
- Games dataframe with:
- Team, pass off, pass def, rush off, rush def, average EPA
- [ ] use aggregated team stats to get average stats for each team coming into each game. this will be
- should be able to use to query something like `"SELECT AVERAGE(pass_yards) as avg_pass_yards, etc. FROM Season where Team1 = 'CHI' or Team2 = 'CHI' and week_number < 10"`
- this would be the input to the model to predict the week 10 game for the chicago bears
- this is going to be a lot of duplicate aggregation if I do this for each week. may be a good way to use previously computed averages. maybe storing totals and counts for each stat in a separate table and then computing averages from that table.
- Team, pass off, pass def, rush off, rush def, average EPA, score spread
- [x] use aggregated team stats to get average stats for each team coming into each game. this will be

score differential is wrong? look at first game. the number for the 2 teams dont match

- [ ] cleanup
- [ ] the get_data and load_data is duplicated in data.py and get_data.py/load_data.py. just use one or the other.
- [ ] move notebook code to python files. think about a managable way to share logic between notebook and python files so I can drop into the pipeline and inspect as needed.
- probably just put everything in a functions that are imported into the python file and notebook?
- [ ] simple model to predict spread
- [ ] use sklearn to train model (linear regression, although maybe not ideal because relationship between features (pass off vs. opponent pass def))
- [ ] some sort of basic analysis to see how it performed. including manually comparing to vegas spread (maybe I can find an average difference? https://www.theonlycolors.com/2020/9/29/21492301/vegas-always-knows-a-mathematical-deep-dive)
- [ ] improve features. either at game aggregation level or team @ week aggregation level
- W/L record or games played and win pct?
- success rate (calculate success (0 or 1) from each play).
- [x] simple model to predict spread
- [x] use sklearn to train model
- could do linear regression, although maybe not ideal because relationship between features (pass off vs. opponent pass def)
- gradient boosting (better with non-linear)?
- [x] some sort of basic analysis to see how it performed. including manually comparing to vegas spread (maybe I can find an average difference? https://www.theonlycolors.com/2020/9/29/21492301/vegas-always-knows-a-mathematical-deep-dive)
- 9-10 pt avg difference (?). normaly distrubution means ~68% will be within 1 std deviation (identified as 14-15). could be a little lower because 1,2, etc. are within 14-15, but could be higher because ~32% will be more than 14-15.
- [ ] improve features/model. either at game aggregation level or team @ week aggregation level
- [ ] W/L record or games played and win pct? (win and loss column on game aggregation)
- [ ] success rate (calculate success (0 or 1) from each play).
- could be measured from positive EPA. or 40% of yards to go on 1st, 70% on 2nd, 100% on 3rd/4th (or similar)
- total points scored/allowed
- actually it looks like there are some success rate fields?
- [x] total points scored/allowed
- [ ] maybe dont use first ~3 games? small sample size but dont want to throw out too much data.
- [ ] games played (could be used as confidence in record/stats)
- [x] add function to create matchups from 2 teams so we can predict next week's games.
- using the running_avg df to merge, similar to how we're merging the game_id to get the final training df
- in practice the merged records should share a week but in theory they could be different (week 12 detroit vs. week 6 ravens etc.).

# Current status:

- project managed by poetry
- Can fetch all play by play data in compressed csv format from ` nflfastR`` releases using `get_data.py``
- Can load raw data into pandas dataframe and sqlite3 using load_data.py. data types are inferred by pandas and sqlite is created from pandas method. this is key because there are a lot of columns (372 at time of writing). When identifying actual columns I want to use I can get more specific with data types.
- Aggregates basic pbp stats into game stats
- Aggregates basic game stats into rollwing averages for each team/year
- Aggregates rolling weekly averages by team
- trains linear regression model to predict spread
- can get matchup from rolling averages for future matches and predict spread

# Next steps (beyond TODO):

- "Blogpost" (or github md readme.)
- Possible change of direction: multilabel classifer to predict if a matchup is normal or an outlier if overdog wins by a lot more than expected or underdog performs much better than expected. Then use that to pick the over/underdog to cover the spread (or to pass on the game)
- Not sure of labels (overdog_performance: normal, underperform, overperform).
- Find games where spread (included in dataset) is off and then train a model to classify them. Input can be the same team stats as the spread predictor, but the dataset will be limited to just the games that are off.
- outlier definition tbd. perhaps abs(real spread - booky spread) is > 1 std deviation? which is 14-15 points according to some article I found (in resource section). In that case I'd expect ~68% of games to be "normal" and ~32% to be outliers.
- Crazy idea: After implmeneting, go back and use this as a feature on the spread. Classify each games as (normal,outperform,underperform) and use that as a feature to train with. Perhaps this will be useful even if the accuracy is somewhat low?

# Stray thoughts:

- What should the model's be guess _exactly_ and what does that say about how the teams are modeled in the input? the spread consists of 2 numbers (usually the inverse of each). 1 for each team. Maybe just predict the hometeam?
- probably need to squash 2 teams into 1 line like: home_team_pass_off, home_team_pass_def, away_team_pass_off, away_team_pass_def, etc.
- Are lots of features bad? What about redundant or mostly redundant features (pass yards, rush yards, total yards (total yards are either equal or very similar to pass+rush yards)). Which should I pick in that case (probably the less aggregated ones)?

# Modeling ideas

## Running Averages By Week

Perhaps somethinglike this for aggregating stats

Bears pass for 170 week 1, and 220 week 2. pass_off_running_avg is the running average, so their average pass yards/game coming into week 3 is 195. This would be the input to the model for predicting the week 3 game along with the bookie spread (and other team).
Expand All @@ -119,7 +145,19 @@ Bears pass for 170 week 1, and 220 week 2. pass_off_running_avg is the running a
| 2023 | 3 | CHI | 195 | |
| etc... | | | | |

Then I could do SELECT pass_off_running_avg FROM Season where Team = 'CHI' week = 3
Then I could do something like SELECT pass_off_running_avg FROM Season where Team = 'CHI' week = 3 to get the averages for a given week.

## Final input

Maybe something like this where:

- spread is for home team
- everything on 1 line with home and away team stats (running average from season start)
- maybe even exclude home/away team? shouldnt really factor them in right? would home field advantage be a factor here?
GameID,Date,HomeTeam,AwayTeam,HomePassOff,AwayPassOff,HomeRushOff,AwayRushOff,HomePassDef,AwayPassDef,HomeRushDef,AwayRushDef,Spread
1,2022-09-10,TeamA,TeamB,250,220,120,100,180,200,90,110,3
2,2022-09-10,TeamC,TeamD,280,240,130,110,200,180,95,100,-7
3,2022-09-11,TeamE,TeamF,220,200,110,90,170,190,85,120,5

# possible spread -> win pct map

Expand Down
24 changes: 24 additions & 0 deletions nfl_analytics/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,30 @@ def load_sqlite():
print(cursor.fetchall())


# def build():
# # TODO: do all the things the dev notebook is doing. splitting into nice functions as necessary
# # For example, could make a function for each time in notebook we are initializing a new dataframe (just a rough guide).
# pass


class Pipeline:
def __init__(self, debug=False):
self.debug = debug
# self.df = pd.DataFrame()

def _fetch_play_by_play(self, years=range(1999, 2024)):
pass

def _load(self):
pass

def _build(self):
pass

# def stuffthatbuildcalls (so I can run in the dev notebook)
# if debug: true, print stuff


if __name__ == "__main__":
get()
load_sqlite()
Loading

0 comments on commit ec67c85

Please sign in to comment.