Releases: KarelZe/thesis
Releases ยท KarelZe/thesis
Changes between December, 19th and December, 25th
Took some time off for Christmas. ๐
What's Changed
Empirical Study โ๏ธ
- Add accuracy for rev lr on test set ๐ by @KarelZe in #89
- Add compliance to
pdf/A-2B
๐ by @KarelZe in #90 - Removal of
TabNet
โ by @KarelZe in #91 - Improve gpu utilization ๐ by @KarelZe in #87
Writing ๐
Outlook ๐งช
- Will take some time off until New Year's Eve. ๐งโโ๏ธ
- Continue pre-writing of supervised approaches i. e., transformers, ordered boosting, and #88.
- Finalize feature sets. Make sure to add enconomic intuition, remove redundancies and transformations. Include feedback from discussions with @CaroGrau and @pheusel.
- Continue work on #85 . Necessary to get insights on feature definitions. Branch will include attention activation and SHAP.
- Continue work on #93. Having a common,
sklearn
-like interface is necessary for further aspects of training and evaluation like calculating SHAP values, creating learning curves, or simplify hyperparameter tuning etc. - Try out
SLURM
to train models overnight or for longer periods than 4 hours. Jobs can run longer for up to 48 hours.
Full Changelog: v0.2.6...v0.2.7
Changes between December, 12th and December, 18th
What's Changed
Empirical Study โ๏ธ
- Add
clnv
results ๐ฏ by @KarelZe in #82. Add results for CLNV method as discussed in the meeting with @CaroGrau . - Add learning curves for CatBoost ๐ by @KarelZe in #83. Helps to detect overfitting/underfitting. Learning curves are now also logged/tracked.
- Improve accuracy [~1.2 %] by @KarelZe in #79. Most of the time was spent on improving the first model's accuracy (gbm).. Planned to improve by 4 %, achieved an improvement of 1.2 % compared to the previous week. Obtaining this improvement required a deep dive into gradient boosting, the catboost library and a bit of feature engineering. Roughly 1/3 of the improvement in accuracy comes from improved feature engineering, 1/3s from early stopping, and 1/3 from larger ensembles/fine-grained quantization/sample weighting. I tried to link quantization found in gradient boosting with quantile transformation from feature engineering, but it didn't work out. Did some sanity checks like comparing implementation with
lightgbm
, time-consistency analysis or updated adversarial validation, - Also, spent quite a bit of time researching on feature engineering techniques, focusing on features that can not be synthesized by neural nets or tree-based approaches.
Writing ๐
- Add reworked TOC and drafts ๐ by @KarelZe in #80 as requested by @CaroGrau.
- Draft for chapters trees, ordered boosting, and imputation๐ฎ by @KarelZe in #81. Continued research and drafting chapters on decision trees, gradient boosting, and feature scaling and imputation. Requires more work e. g., derivations of loss function in gradient boosting for classification was more involved than I expected. The draft is not as streamlined as it could be.
Outlook ๐
- Focus on drafting chapters only on gradient boosting, basic transformer architectures and specialized architectures.
- Train transformers until meeting with @CaroGrau, but spent no time optimizing/improving them.
Full Changelog: v0.2.5...v0.2.6
Changes between December, 5th and December, 11th
What's Changed
Empirical Study โ๏ธ
- Add implementation and tests for
FTTransformer
๐ฆพ by @KarelZe in #74. Adds a tuneable implementation of theFTTransformer
from https://arxiv.org/abs/2106.11959. Most of the code is based on the author's code published by Yandex. Wrote additional tests and made the code work with our hyperparameter search. - Add implementation and tests for
TabNet
๐ง by @KarelZe in #75.TabNet
is another transformer-based architecture published in https://arxiv.org/abs/1908.07442 and the last model to be implemented. ๐ Code is based on a popular PyTorch implementation. Made it work with our hyperparameter search and training pipeline and wrote additional tests. - Add tests for all objectives ๐ฏ by @KarelZe in #76. All training objectives defining the hyperparameter search space and training procedure now have tests.
- Add intermediate results of
TabTransformer
andCatBoostClassifier
๐ by @KarelZe in #71. Results as discussed in the last meeting with @CaroGrau. - Accelerate models with
datapipes
andtorch.compile()
๐ by @KarelZe in #64. Tested how the new features (datapipes
andtorch.compile()
) could be used in my project. Still to early as discussed in the meeting with @CaroGrau. - Make calculations data parallel ๐ฃ๏ธ by @KarelZe in #77. All models can now be trained on multiple gpus in parallel, which should speed up training considerably. BwHPC provides up to four gpus that we can use. For gradient boosting, features are split among devices. For neural nets batches are split.
- Add pruning support for Bayesian search ๐งช by @KarelZe in #78. I added support to prune unsuccessful trials in our Bayesian search. This should help with training and finding better solutions faster. Additional to the loss, the accuracy is also reported for all neural nets. Moreover, I integrated early stopping into the gradient boosting models, which should help to increase the performance. Also widened the hyperparameter search space for gradient boosted trees, which should help to find better solutions. Still have to verify with large studies on the cluster.
Writing ๐
- Add questions for this week ๐ by @KarelZe in #70
- Connect and expand notes ๐ฉโ๐ by @KarelZe in #65. Was able to slightly decrease the pile of papers. However, also found several new ones, like the linformer paper (https://arxiv.org/abs/2006.04768).
Other Changes
- Bump google-auth from 2.14.1 to 2.15.0 by @dependabot in #66
- Bump fastparquet from 2022.11.0 to 2022.12.0 by @dependabot in #69
Outlook ๐ช
- Finalize notes on decision trees / gradient boosting. Prepare the first draft.
- Update table of contents.
- Go back to eda. Define new features based on papers. Revise existing ones based on kde plots.
- Create a notebook to study feature transformations / scaling e. g., log transform / robust scaling, systematically.
- Study learning curves for gradient boosting models and transformers with default configurations. Verify the settings for early stopping.
- Perform adversarial validation more thoroughly. Answer questions like: which features drive the difference between training and test set? What aspect does time play? What would happen, if problematic features were excluded?
- Increase test accuracy by 4 %.
Full Changelog: v0.2.4...v0.2.5
Changes between November, 28th and December, 4th
What's Changed
Empirical Study โ๏ธ
- Exploratory data analysis ๐ by @KarelZe in #40, #50, and #51. Refactored eda to training set only, included new features, and added new visualizations. Performed cv to study the results of a gradient-boosted model on different feature sets. Overall the results look promising, as the untuned gradient-boosted tree outperforms the respective classical counterparts on all three subsets. Performance is relatively stable across folds and validation and test set. Log transform and imputation make no difference for gradient-boosted trees, as expected. Problems like the high cardinality of some categorical values still need to be addressed. I also feel like more features or other transformations could help. Will do more research in classical literature on this.
- Add model tracking and add saving of
optuna.Study
towandb
through callback๐ซ by @KarelZe in #55 and #63 by @KarelZe. All major data (data sets, models, studies, and some diagrams) is no tracked inwandb
and saved togcs
. The use of callbacks makes the implementation of other parts like learning rate schedulers or stochastic weight averaging, much easier. - I experimented with further accelerating the
TabTransformer
throughPyTorch 2.0
anddatapipes
done in #64 (WIP) @KarelZe. This is still in progress, as I wasn't able compile the model yet. Got a vague idea of possible solutions e. g., stripped down implementation or upgrade of cuda. Also, waiting could help, asPyTorch 2.0
is in early beta and has only been announced at the weekend. Didn't look intodatapipes
yet, which could help closing the gap for serving the gpu enough data. Also did some research on how high cardinality features likeROOT
can be handled in the model to avoid an explosion in parameters. The later is necessary to train the model with reasonable performance. - Increased test coverage to 83 % or 45 tests in total. Writing these tests helped me discover some minor bugs e. g., in depth rule, I had previously overlooked. Tests were added for:
- Add new features to train, validation and test set ๐๏ธ by @KarelZe in #51
- Refactor redundant code
ClassicalClassifier
๐งซ by @KarelZe in #52 and #53. - Remove RDB support ๐ฆฃ by @KarelZe in #54
Writing
- Add new notes and connected existing ideas. I was able to reduce the pile of unread papers to 30. #65 (WIP) @KarelZe.
Other Changes
- Add
dependabot.yml
for dependency bot ๐ฆพ by @KarelZe in #56 - Bump schneegans/dynamic-badges-action from 1.4.0 to 1.6.0 by @dependabot in #57
- Bump typer from 0.6.1 to 0.7.0 by @dependabot in #62
- Bump fastparquet from 0.8.3 to 2022.11.0 by @dependabot in #60
Outlook:
- Pre-write first chapter of thesis
- Set up plan for writing
- Read 10 papers and add to zettelkasten
- Turn eda notebook into scripts for feature generation and document with tests
- Train TabTransformer and gradient boosted model until meeting with @CaroGrau
- Further improve training performance of TabTransformer. Try out into
datapipes
and adjust implementation to be closer to paper. Decrease cardinality through nlp techniques
New Contributors
- @dependabot made their first contribution in #57
Full Changelog: v0.2.3...v0.2.4
Changes between November, 21st and November, 27th
What's Changed
Empirical Study โ๏ธ
- Add TabTransformer baseline ๐ค by @KarelZe in #34. Involved implementation and documentation of the model, early stopping, data set and data loader. Most notably, I was able to speed up the implementation of https://github.com/kathrinse/TabSurvey/ by factor x9.8 (see notebook through an improved data loader, decoupling of training and data loading, and mixed precision support. Also tested were fused operations, pre-loading, and the use of pinned memory. An analysis with the PyTorch profiler reveals that the GPU is now less idle. Training on the entire data set is theoretically possible.
- Fix classical rules๐ by @KarelZe in #41. The issue came up during last week's discussion with @CaroGrau. The differences in accuracy are tiny. Usually < 1 %.
- Add test cases for classical classifier โ๏ธ by @KarelZe in #42. Tests are formal e. g., correct shapes of predictions or fitting behaviour.
- Add implementation of
CLNV
method ๐๏ธ by @KarelZe in #43 - Add tests for TabTransformer โ๏ธ by @KarelZe in #44. Test for shapes of predictions, for parameter updates and convergence.
Writing ๐
- Add questions for this weeks meeting โ by @KarelZe in #39
- Researched techniques and new papers on speeding up transformers
Outlook ๐ญ
- Read more again and minimize the stack of open papers (40+)
- Better connect existing ideas in zettelkasten
- Finish exploratory data analysis i. e., include new features, refactor to training data only, and do CV to better understand features
- Improve test coverage i. e., data loader and classical rules
Full Changelog: v0.2.2...v0.2.3
Changes between November, 14th and November, 20th
What's Changed
Empirical Study โ๏ธ
- Add tuneable implementation of gbm and classical rules ๐โโฌ by @KarelZe in #35 and #27. Models can now be trained using parametrized scripts i. e.,
python src/models/train_model.py --trials=5 --seed=42 --model=gbm --dataset=fbv/thesis/train_val_test_w_trade_size:v0 --features=ml
. Data is loaded from versioned artefacts. Interrupted studies can now be continued at a later point in time. Added some tests for search. Bayesian search is implemented for gradient-boosted trees, TabTransformer, but also classical rules. Thereby, I was able to find combinations of classical rules previously not reported in the @CaroGrau paper. - Added TabTransformer implementation by @KarelZe in #34 (WIP). The current implementation performs binary classification and can be tuned using Bayesian search. Issues to resolve: improve utilization of accelerator, speed up training, and code quality. Plan to address these with
PyTorch profiling
,CUDA profiler
, a customPyTorch Dataset
, and a bit of luck ๐ฒ . - Add basic docker support ๐ณ by @KarelZe in #28. Docker image now available on docker hub.
- Add compliance to
pre-commit
hooks ๐ช by @KarelZe in #33. Pre-commit hooks help to avoid potential bugs. Code inmain
is now fully-documented and annotated with type-hints. - Simplified project and test setup ๐งฏ by @KarelZe in #38. This greatly improves reproducibility and ease of development.
Writing ๐
- Add notes to zettelkasten ๐๏ธ by @KarelZe in #29
- Add proposal for feature sets ๐ง by @KarelZe in #31
- Simplified and extended
readme.md
๐ by @KarelZe in #36 - Finalized expose with numbers ๐ฅณ by @KarelZe in #37
Outlook ๐ญ
- Continue with exploratory data analysis and start with explanatory data analysis
- Analyze low resource utilization and slow training of TabTransformer
- Read more again and minimize the stack of open papers (30 +)
- Better connect existing ideas in zettelkasten
- Improve test coverage
Full Changelog: v0.2.1...v0.2.2
Changes between November, 7th and November, 13th
What's Changed
Empirical Study โ๏ธ
- Added all classical rules from @CaroGrau paper. ๐ Any rule can now be stacked together in an arbitrary order
predict_rules(layers=[(trade_size,"ex"), (quote,"best"), (quote, "ex")], name="Tradesize + Quote (NBBO) + Quote (ISE)")
. Minor differences in accuracy still exist due to a different handling of missing values by @KarelZe in #27 (WIP) - Started with a exploratory data analysis by @KarelZe in #25 (WIP)
- Added proposal for feature set definition by @KarelZe in #29 (WIP)
- Created a docker image to run code on bwUniCluster 2.0 and runpod by @KarelZe in #28 (WIP)
- Did all the set up to connect to bwUniCluster 2.0
Writing ๐
Outlook for Upcoming Week ๐ญ
- implement a transformer-based baseline
- rework hyperparameter searches to be interruptable
- continue with exploratory data analysis and start with explanatory data analysis
- add more notes to zettelkasten
- better connect existing ideas in zettelkasten
- finish WIPs
Full Changelog: v0.2.0...v0.2.1
Changes between October, 31st and November, 6th
What's Changed
Empirical Study โ๏ธ
- Improved adversarial validation and memory-constrained CSV loading โ๏ธ by @KarelZe in #17, #18 and #24
- Implementation of (promising) GBM baseline, Bayesian search, some classical rules and robustness checks๐งธ by @KarelZe in #19
- Experimented with training models on
runpods.io
to mitigate severe performance issues withGoogle Colab
Writing ๐
- Restructured readme โ๏ธ by @KarelZe in #21 and #22
- Added 15+ literature notes to zettelkasten ๐๏ธ by @KarelZe in #23 and #20
- Researched additional 30+ papers to read in the next week
Outlook for Upcoming Week ๐ญ
- start with explanatory data analysis
- investigate differences in the accuracy of classical rules with regard to @CaroGrau paper
- start implementing a transformer-based baseline
- add more notes to zettelkasten
- bundle training scripts in docker container
Full Changelog: v0.1.9...v0.2.0
Changes between 27th October and 30th October
What's Changed
Empirical Study โ๏ธ
- Set up Google Cloud Storage and Google Colab.
- Loaded csv data into pandas data frame, inferred dtypes, performed optimizations and exported into
.parquet
chunks by @KarelZe in #12. - Added data set versioning using weights & biases.
- Created sub samples e. g., 2015 and train, validation, and test set.
- Cleaned up requirements / fix version in
requirements.txt
. - Created tests / assertations against @CaroGrau paper
2.0-mb-data_preprocessing_loading_splitting.ipynb
. - Ran adversarial validation in
2.0-mb-data_preprocessing_loading_splitting.ipynb
.
Writing ๐
- Added more notes to Zettelkasten e. g., 6f704ff.
Other changes
Full Changelog: v.0.1.6...v0.1.9