Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last Game stats via balldontlie.io by Nealmick #356

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

nealmick
Copy link

@nealmick nealmick commented Nov 12, 2023

These additions are designed to have minimal impact on the project and work as an extension, optionally adding data on top for both historical and future games.
Currently only implemented for xgb ml model but could very easily work for the rest of the models as-well.
Added a -lg command line flag when both training and predicting with xgb ml models.
train:
python -m XGBoost_Model_ML -lg
predict:
python3 main.py -xgb -odds=fanduel -lg

for this to work you must first train the model then make sure the xgb runner is set to load the new model weights...

This does not interact read or write from the sqlite db, it simply takes a dataframe augments it with the new data, and returns it. if -lg is not specified there should be no meaningful changes to any functionality.

How is the data collected??
We use the game date and team names to create an api call to balldontlie.io a free open source project.

url = 'https://www.balldontlie.io/api/v1/games?start_date=' + one_month_before_game + '&end_date=' + one_day_before_game + '&&team_ids[]=' + str(team_id) + '&per_page=100'

This returns games from a team with a date range of one day ago to 1 month ago.
The script loops over the games finds the one closest to the current date and set it as the last game id.
Once we know a teams last game we request all of the player stats from that game.
loop over the player stats and find the 5 players with the most playtime for each team.
After adding all total 20 players last game stats and both the home and visitor scores for both history games, we should be at 453 input parameters now. 17 player stats * 5 players per team * 4 teams = 340 + 113 original data points before dropping.

Each player has these stats:
labels = ['ast','blk','dreb','fg3_pct','fg3a','fg3m','fga','fgm','fta','ftm','oreb','pf','pts','reb','stl', 'turnover', 'min']

Why last game?
Offers momentum based data similar to streaks. Your already have alot of data that is normalized over a longer period. Short term data i think could improve accuracy. This also helps with injuries if a star player has been injured they wont be in the last game. This also completely removes the need to deal with rosters. We never have to worry which players are on which team, we just get all of the players from the last game in a single api call.....

Took me a few days to put this together, im certain there are still a few bugs. Let me know if you see an issue or have any suggestions!

This heavily uses pickle object to save and and load data, its super fast. After running xgml train with -lg you can uncomment this:
line 250
#return load_obj('res')#uncomment here to load from cache
and future training will load from cache instead of recalculating every row...

there is probably alot more im forgetting but its a work in progress.

@nealmick
Copy link
Author

Already think i found a bug or two haha so just bare with me gona update this over next few days

@kyleskom
Copy link
Owner

Mark this as a draft pr for now and updated to ready for review when its all set!

@nealmick nealmick marked this pull request as draft November 12, 2023 23:18
@nealmick
Copy link
Author

Sounds good, gona take a few days to get this mint and work out any issues

@nealmick
Copy link
Author

nealmick commented Nov 13, 2023

fixed a few things, and added over/under model this does now create the complexity where its possible to have only one model trained with lg data causing issue if lg data passed to both ou and ml... need to be able to mix and match and detect how many input features. I added a model attribute which saves the number of parameters when its trained:
model.set_attr(num_features=str(data.shape[1]))

This way when we load the model back up we can detect how many parameter the model should have. This does then create the question why even have lg flag? While it can detect the parameters and functionally work without the a cli flag, ive still left it soo it works on-top. This way it will only get lg data if the model was trained with the correct amount of parameters, and lg is specified, giving a way to bypass the logic and turn it off, as-well as mix and match models... This does create the effect where if your setup to run with default models and specify -lg it wont error when you may expect it to and used too, because even though -lg was specified the parameter count or lack there of for the default model is detected causing no lg data to be added, where it would error before about wrong number columns.

it's getting better but was super busy today, It would be sweet to have a few buttons on the flask app to trigger training or switch models...

I think all the data is being added properly now, and i think the major functions are mostly working. Mainly just testing, and adjusting surrounding controls. Still should document and really go through and make sure things are done clearly. I want to add some flask stuff too but gona hold off on that for now.

@nealmick
Copy link
Author

Cleaned up and added more comments. Added a flask route for http://127.0.0.1:5000/lg will run from flask with -lg flag, currently setup only with fanduel. Gona mark this as ready for review, there is alot of changes here so im fully open to any suggestions or fixes!

@nealmick nealmick marked this pull request as ready for review November 14, 2023 23:03
@nealmick
Copy link
Author

Been monitoring over the past week, seems to be doing well. Haven't seen any new issues so-far.

@kyleskom
Copy link
Owner

Going to review this soon!

@nkashyap14
Copy link

image

What I see when I serve the lg route on the flask app.

@nealmick
Copy link
Author

"currently setup only with fanduel"
Plus it detects the input parameter length, if you haven't set a custom lg model, it will load default data and default model.

https://github.com/kyleskom/NBA-Machine-Learning-Sports-Betting/blob/ca91bb2c1c271b18bb027de7ee65115c0511862a/Flask/app.py

I just want to be clear, this is only tool. With the addition of the lg data, this project is probably one of the most comprehensive models publically available. This project can be a great resource for learning about sports analytics and large scale data both historical and future. I recommend trying to understand how the project works, how we get data, how we process data, how models are trained and so on.... This is not a commercial product with a support team, just open source tool.... There is alot of good information in the discussions, most common questions are answered over there...

@nkashyap14
Copy link

Simply overlooked the part where you said it was only set up for fanduel. My apologies in a rush to assist wasted your time.

@nealmick
Copy link
Author

all good 👍

@rbulcher
Copy link

After running this and comparing it to the model without -lg training and predicting, I dont see much change if any to the numbers / moneylines / over unders. In your experience is training with lg enabled have any impact on the predictions or are the last game data so insignificant overall it doesnt impact outputs? Perhaps I am running something incorrectly? I train as explained in the main readme then compare that to training the model with python3 -m XGBoost_Model_ML -lg then predict with python3 main.py -xgb -odds=fanduel -lg

@nealmick
Copy link
Author

nealmick commented Nov 27, 2023

If there is literally no change then its probably running with same model, and something was done wrong....
If you mean there is little change, or you don't see an improvement in accuracy, that's a different claim...

To address the first point, i do think all of the data is being added properly... Gets top 5 players based on minutes played, and uses those players actual stats from their last game... I think the data is generally working.... I definitely recommend checking my work and seeing if i missed something, these are large scale data transformations, its not always easy to spot
bugs....

To address the accuracy.... I mean im not claiming any level of accuracy... Last game data does correlate to game scores. Overall team level stats tend to have the highest correlation because it represents a full team, individual player stats are more specific but offer more granular insights into the game... Last game data offers momentum type data similar to streaks. Ive found best way to think of this is if there are different frequencies of data, You can have averages of last 5 games, 10, 30…. all the way up to season averages, and you can do that for both player and team level stats..... From my research i have found that there is somthing like a dead zone of data in between the momentum short term and seasonal long term. I think momentum data like win/loss streaks correlates very highly to game scores, so do season averages or a low frequency long term data. I think in between the short term and long term, that data may offer little insights, or not correlate highly.

I also kinda get the feeling some of the model settings are causing over fitting. Dont have nearly as much experience with xboost, or anything outside of TensorFlow, but it all kinda transfers. In general I found that remarkably simple models can learn quite complex patterns. With machine learning I often find less is more with model parameters. The number of epochs, number of games computed in a batch, number of nodes, layers and so on… I find less is more. Complex models struggle more with overfitting. Overfitting is when the model stops learning the essence of the game outcome compared to the inputs, but instead it starts remember specific results, and begins to act more like a hard drive. Basically if the model has enough weights to just remember all the results of the games it was trained on…. I find it does this because its easier and faster way for the model to reduce its loss, which is the models goal…. This is called overfitting, and one of the core issues with machine learning. I have found restricting the models complexity to the point where it has to give up the memory, because it can’t reduce loss because it can’t remember, and instead shift to learning the essence of the game or generalize how the inputs tend to effect the outputs, instead of remember individual results.

I highly recommend turning down the model settings to the point where you can barley get it train at all, then work up from there, because that simple model can’t overfit, atleast not as bad….

Also I will say we try to combat or at least recognize overfitting models by using hold out data sets. This may not be the best for basket ball, as the same teams play each other, and to some extent depending on what data your using, its even possible for the input of two games to basically be the same, but then have different actual game outcomes… like to the model same stats different outcome…. Which may make model perform ill just say different on hold out test data……

Also there is the idea that basketball does have a good bit of randomness, or un-captured information…. It certainly is also possible that Vegas odds models have already accounted for most of the predictability in the game… and there is littler benefit to addition analysis beyond that point….. Like if the game is 25% predictable, 75% random, and Vegas is already capturing 25%, or are correct 25% more often then 50/50….. they would have captured pretty much all the predictability… then applied a handicap to the team, like -8 or whatever, and effectively yanked out any value there. This is great idea to think about if you want to demotivate yourself and think that you have wasted 4 years trying to build models for nothing. The fact is though that Vegas odds are not predictive of game outcomes. Vegas does not care at all about game out comes, they only care about keeping an even percentage of money on both sides of the action. that way they dont loose….. The handicaps definitely make it more difficult. There are models that have beat Vegas, its also not clear if the lines are becoming more or less accurate overtime… I think its important to also understand that the game of basketball changes over time, and a single static model can’t keep up forever, eventually the game moves on….. Event to the point where different stats matter more…. Like 3 point shots. Ill also just say, if you do end up beating Vegas, its not a money printer. First off it probably takes years to ever get to that point, but when you do, its very obvious. And therefore casinos limit your play, and ask you to leave…. Still possible to make some money, but not infinite… Probably best to sell model access and enable other people to make money then…. But the thing is, you should recognize this isn’t a normal market, its a fixed market. You only get to play while your losing… as soon as someone beats the market or whatever, it just disappears. Casinos can’t exist while constantly loosing…. Soo its kinda a balance….

@nealmick
Copy link
Author

nealmick commented Nov 27, 2023

Just to add to this, "it doesnt impact outputs", this really sounds like the same model. To be clear when you train a model, the system does not automatically start using the new model. Additionaly even if you retrain on the same data your gona get different predictions. This is almost certainly the problem here.

To change the model your using, that model must actually be trained with the lg data. the number of data features is counted and saved with the model. When the model is loaded back up it detects how much data was used in training..... you must use lg data in training, and specify -lg command when predicting future games. the last part is to specify the new model in the runner.... Soo look in the runner file and make sure the runner is using the new model. Simply running training does not trigger the system to use a new model. This is default behavior btw, and same with non lg, if you trained a model, you must update the runner to use that model..... otherwise it just uses default model.... this is intended and good, i think at atleast....

Okay ill explain it real simple like this:
Train a model:
python -m XGBoost_Model_ML -lg
Now look in models folder. there should be NN_models and xgboost models folder.
Those are the default model checkpoint saved weights.....
Now if you are following along and your training has started or completed, you should begin to see new files appearing in the models folder. These files have names like this:
XGBoost_66.7%_ML-4.json

That is an example of saved model checkpoint/weights.
How the training works currently, model only saves weights if it increases accuracy.... soo we doing like 10 epochs or whatever, so first epoch save, second epoch maybe we no improve so no save, third epoch accuracy goes up, save again.... okay sooo you get like this trail of saved checkpoints...... pick one.... copy the file path....

now open up src/predict/XGBoost_runner.py
You should see on the first few lines where the models for both ou and ml are specified... these are file paths that should show the defaults...... like:
'Models/XGBoost_Models/XGBoost_53.7%_UO-9.json'

So now you need to replace that string there with the path to your new weights. There is a catch here.

you must make the path this: Models/XGBoost_blah blah yourfile here .json
Not path like this: Models/XGBoost_Models/XGBoost_blah blah yourfile here .json
Notice the difference, custom models are one folder back....

okay thats pretty much it, you have new model set to load.... hopefully trained with lg data.... now it can actually load that model up and use it..... keep in mind it still detect the parameters and you still need to specify -lg.... these are safety things to keep the system working if you mix and match models.... if you train ou with lg, but ml without lg... you can specify -lg and it will work perfectly detecting which model uses lg and seamlessly adding the data as need..........

@rbulcher
Copy link

Wow, Thank you for all of that nealmick. I was in fact doing the lg predictions wrong, not using the correct weights. I want to also mention that I appreciate your post and explanation, and that despite vegas odds being what they are, I enjoy your work here. I understand this isnt a one stop solution to perfect bets, but the insight is at a minimum fun to enjoy, and at best a solid tool to try and predict a performance. So for that, and all you and your teams hard work, thank you.

@Gxent
Copy link

Gxent commented Nov 28, 2023

@nealmick Can you explain to me how to set this up? I don't understand how I should do it?

@nealmick
Copy link
Author

Bruh, you really gotta try gpt. Please, go ask gpt, this "i have a question about a github repo i want to install, someone made a fork and they have a branch with some new feature i want to test, how can i do this?" notice how this has nothing to do with the project, and could be replaced with any repo.....

User
i have a question about a github repo i want to install, someone made a fork and they have a branch with some new feature i want to test, how can i do this?

ChatGPT
To install and test a new feature from a forked branch of a GitHub repository, you can follow these steps:

Clone the Forked Repository: First, you need to clone the forked repository to your local machine. You can do this by finding the URL of the forked repository on GitHub. It usually looks something like https://github.com/[username]/[repo-name].git. Then, use the command:

bash
Copy code
git clone https://github.com/[username]/[repo-name].git
Replace [username] and [repo-name] with the actual username and repository name.

Switch to the Desired Branch: Once the repository is cloned, you need to switch to the branch that contains the new feature. You can do this using the git checkout command. For example:

css
Copy code
git checkout [branch-name]
Replace [branch-name] with the name of the branch that has the new feature.

Install Dependencies: If the project has any dependencies, you should install them. This usually involves running a command like npm install or pip install -r requirements.txt, depending on the language and environment.

Run the Project: Depending on the nature of the project, you might need to build it or run it directly. This could be something like npm start, python main.py, or a similar command relevant to the project.

Test the New Feature: Now that you have the project running with the new feature, you can begin testing it. How you test it will depend on the nature of the feature and the project.

Remember to check the README.md or other documentation provided with the repository for any specific instructions related to setting up or running the project.

User
great!!! is there is there an easier way to do this with -b flag??
ChatGPT
Yes, there is a more streamlined way to clone a specific branch from a forked repository using the -b flag with the git clone command. This allows you to clone the repository and directly check out the specific branch in one step. Here's how you do it:

Clone and Checkout the Branch: Use the following command to clone the forked repository and directly switch to the desired branch:

bash
Copy code
git clone -b [branch-name] https://github.com/[username]/[repo-name].git
Replace [branch-name] with the name of the branch you want to test, and [username] and [repo-name] with the username and repository name.

This command clones the repository and automatically checks out the specified branch.

Install Dependencies: As before, if the project has any dependencies, install them. This typically involves running a command like npm install, pip install -r requirements.txt, or a similar command relevant to the project's environment.

Run the Project: Depending on the project, you might need to build it or run it directly. This could involve commands like npm start, python main.py, or other relevant commands.

Test the New Feature: Now, you can test the new feature in the context of the project. The specific steps for testing will depend on the nature of the feature and the project.

This method is particularly useful when you want to quickly set up and test a specific branch without the need to manually check out the branch after cloning the repository.

@nealmick
Copy link
Author

nealmick commented Dec 5, 2023

Added credits at bottom of readme, and disclaimer.

@nealmick
Copy link
Author

nealmick commented Dec 5, 2023

Added note in readme about updating runner when training, also added -h command in readme. Disclaimer should hopefully create some clarity around the intended use case and prevent any harm.

@nealmick
Copy link
Author

nealmick commented Dec 5, 2023

@kyleskom just reaching out any updates on review?

@kyleskom
Copy link
Owner

kyleskom commented Dec 5, 2023

I haven't yet. Im going to try to get to it this weekend. I apologize for the delay

@nealmick
Copy link
Author

nealmick commented Dec 6, 2023

no problem, sounds good

Repository owner deleted a comment Dec 25, 2023
@bklynate
Copy link

+1 bump—I'd love to see this work merged or at least a conversation revived on this topic.

@kyleskom
Copy link
Owner

Im slowly going through prs right now, this one is fairly large so going to take more time

@bklynate
Copy link

Agreed, 31 files changed in a single PR is not what you want lol. From skimming through the PR a lot of the changes are new data pickle files.

@nealmick
Copy link
Author

lol yea its been a while, balldontlie.io updated and requires api key now also some endpoints changed, would need to update logic. I wouldnt worry about reviewing this for now, may try to fix for next season tho.

@kyleskom
Copy link
Owner

Ok makes sense. Sorry it took so long to get to this. Ill keep this open for now if you decide to revisit

@stratty7
Copy link

Hi @nealmick, interesting PR. Have run it locally and updated the balldontlie endpoints and all appears to work correctly, was fun to pick through how it works. I am curious about using pickle files as a cache/storage as opposed to using sqlite like is used in the rest of the repo.

One quick question. How did you generate the 20xxGames.pkl files, seems like it would be an important tool to have for future training if this PR gets merged (training against latest player stats seems important as league trends change).

Regardless, thanks for sharing. Code and your write-ups here are good read and very interesting 👍

@nealmick
Copy link
Author

Hey @stratty7 sounds good, I just use pickle for my own projects that's how i originally stored the data, mainly because its easy and all the data can be loaded into memory, also not even 1gb of memory is needed. all the data pickle files are generated using balldontlie.io api, looping over all the data and saving it slowly, took several days originally. i think there is higher rate limit now if you pay for the api key, which i do and could share. it does need an update script to gather new data for new seasons, I did have a script i used to get the data originally, could probably modify it to update stuff. But first it may need updating to work for new season with current data, which it may work already, im not certain haven't been able to test.

Also i wanted to make a new cli flag to change the date so we can go into prediction mode out of season, however when i looked into that, the other api's for getting odds may cause problems or at least need extra logic for that. it would be great to test and fix things before season starts, but difficult with no games.

Appreciate the positive feedback, I could maybe convert the pickle data objects to sqlite tables and stuff but in the end it all gets loaded back in memory before training. In like a month or so the main season starts and if i can get it working and cleaned up maybe will try and get it merged again. but also i mean i have no problem just leaving this as a fork and have it exist like this...

@nealmick
Copy link
Author

okay did some fixes, and was able to get it working, today is first day of season. Really like the new flask app, looks fantastic.

@nealmick
Copy link
Author

output.mp4

added popup modal windows. this allows teams to be clicked which then opens up a roster showing player status, and players can be clicked to open player detail page showing 10 game history and detailed injury status.

@bklynate
Copy link

This is amazing work

@bklynate
Copy link

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/Users/bklynate/dev_work/aiWorld/betting/NBA-Machine-Learning-Sports-Betting/src/Process-Data/Create_Games.py", line 90, in
home_team_series = team_df.iloc[team_index_current.get(home_team)]
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/bklynate/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/indexing.py", line 1153, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/bklynate/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pandas/core/indexing.py", line 1711, in _getitem_axis
raise TypeError("Cannot index by location index with a non-integer key")
TypeError: Cannot index by location index with a non-integer key

I am using python 3.11 btw and I am getting this error when I run python -m Create_Games

@nealmick
Copy link
Author

I had not run that command before but trying it now seems to work for me. Also it may not be necessary to create new games unless you want to update the training data knowledge cutoff. I have always run training with just the default db and without creating new games or updating. I'm not sure when the last time the sqlite db was updated but its probably fairly recent.

I think the issue is here:

home_team_series = team_df.iloc[team_index_current.get(home_team)]

No changes have been made there, maybe @kyleskom can help. From looking at that file it seems to be having trouble forming the data for 2023-2024 season. When running Create_Games I did not run Get_Data or Get_Odds_Data first, which might be why it did not error for me. Either way if you just skip create new games or gathering new data it should work fine. It will still get updated data for today's games, it just wont be able to train up to the current date, and training will be cut off whenever the db was lasted updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants