In this repository, I attempt to predict stock prices using the Stock Fundamentals data set (http://www.usfundamentals.com/). I have chosen to explore this data set because it is easily accessible while also being relatively unclean (e.g. has missing values), allowing me to develop and exercise my data science skills, in addition to letting me apply what I have learned about machine learning thus far.
This is my first end-to-end machine learning personal project.
I have opted to not use Jupyter notebooks due to personal preference, but it is a tool I would like to explore in future personal projects.
Script (in order of production):
-
exploratory_data_analysis.py - we begin by exploring a simple part of this data set, the latest quarterly snapshot (latest-snapshot-quarterly.csv) to get a feel for the data and how to process and analyse it. quarterly_snapshot_feature _importance.png and quarterly_snapshot_hist.png are two plots generated by this script.
-
predict_stock_per_company.py - uses the quarter information for each company to try to predict current stock value for that company.
-
predict_stock_per_indicator.py - uses quarter information for each indicator to predict current stock value given information about said indicator.
-
predict_stock_from_indicators.py (the most mature prediction model) - uses quarter information from 10+ common indicators to predict stock value using as targets the average EarningsPerShareDiluted across all quarters. Reaches acceptable levels of correlation (test R2 = 0.669 ), but error seems too hefty, likely due to outliers (there are some very large positive earning values). Companies that were extreme outliers for our target were removed due to how they affect predictions. This means our model is likely weak when it comes to predict the rare very positive or very negative stock values, but makes it better for the wider breadth of more modest values. XGBoost was chosen to model the data because ensemble methods are robust versus outliers, and gradient boosted trees provide less overfitting and more hyper-parameter control compared to random forests, while still being quite fast.