Releases · KarelZe/thesis

26 Dec 16:49

KarelZe

v0.2.7

85c44bc

Changes between December, 19th and December, 25th

Took some time off for Christmas. 🎄

What's Changed

Empirical Study ⚗️

Add accuracy for rev lr on test set 🆕 by @KarelZe in #89
Add compliance to pdf/A-2B 👒 by @KarelZe in #90
Removal of TabNet ❎ by @KarelZe in #91
Improve gpu utilization 🚂 by @KarelZe in #87

Writing 📖

Pre-writing of feature engineering and questions🪛 by @KarelZe in #86.

Outlook 🧪

Will take some time off until New Year's Eve. 🧘‍♂️
Continue pre-writing of supervised approaches i. e., transformers, ordered boosting, and #88.
Finalize feature sets. Make sure to add enconomic intuition, remove redundancies and transformations. Include feedback from discussions with @CaroGrau and @pheusel.
Continue work on #85 . Necessary to get insights on feature definitions. Branch will include attention activation and SHAP.
Continue work on #93. Having a common, sklearn-like interface is necessary for further aspects of training and evaluation like calculating SHAP values, creating learning curves, or simplify hyperparameter tuning etc.
Try out SLURM to train models overnight or for longer periods than 4 hours. Jobs can run longer for up to 48 hours.

Full Changelog: v0.2.6...v0.2.7

Contributors

KarelZe, pheusel, and CaroGrau

Assets 4

18 Dec 19:58

KarelZe

v0.2.6

20791d1

Changes between December, 12th and December, 18th

What's Changed

Empirical Study ⚗️

Add clnv results 🎯 by @KarelZe in #82. Add results for CLNV method as discussed in the meeting with @CaroGrau .
Add learning curves for CatBoost 🐈 by @KarelZe in #83. Helps to detect overfitting/underfitting. Learning curves are now also logged/tracked.
Improve accuracy [~1.2 %] by @KarelZe in #79. Most of the time was spent on improving the first model's accuracy (gbm).. Planned to improve by 4 %, achieved an improvement of 1.2 % compared to the previous week. Obtaining this improvement required a deep dive into gradient boosting, the catboost library and a bit of feature engineering. Roughly 1/3 of the improvement in accuracy comes from improved feature engineering, 1/3s from early stopping, and 1/3 from larger ensembles/fine-grained quantization/sample weighting. I tried to link quantization found in gradient boosting with quantile transformation from feature engineering, but it didn't work out. Did some sanity checks like comparing implementation with lightgbm, time-consistency analysis or updated adversarial validation,
Also, spent quite a bit of time researching on feature engineering techniques, focusing on features that can not be synthesized by neural nets or tree-based approaches.

Writing 📖

Add reworked TOC and drafts 🎆 by @KarelZe in #80 as requested by @CaroGrau.
Draft for chapters trees, ordered boosting, and imputation🌮 by @KarelZe in #81. Continued research and drafting chapters on decision trees, gradient boosting, and feature scaling and imputation. Requires more work e. g., derivations of loss function in gradient boosting for classification was more involved than I expected. The draft is not as streamlined as it could be.

Outlook 🎆

Focus on drafting chapters only on gradient boosting, basic transformer architectures and specialized architectures.
Train transformers until meeting with @CaroGrau, but spent no time optimizing/improving them.

Full Changelog: v0.2.5...v0.2.6

Contributors

KarelZe and CaroGrau

Assets 4

11 Dec 16:05

KarelZe

v0.2.5

5fac116

Changes between December, 5th and December, 11th

What's Changed

Empirical Study ⚗️

Add implementation and tests for FTTransformer 🦾 by @KarelZe in #74. Adds a tuneable implementation of the FTTransformer from https://arxiv.org/abs/2106.11959. Most of the code is based on the author's code published by Yandex. Wrote additional tests and made the code work with our hyperparameter search.
Add implementation and tests for TabNet 🧠 by @KarelZe in #75. TabNet is another transformer-based architecture published in https://arxiv.org/abs/1908.07442 and the last model to be implemented. 🎉 Code is based on a popular PyTorch implementation. Made it work with our hyperparameter search and training pipeline and wrote additional tests.
Add tests for all objectives 🎯 by @KarelZe in #76. All training objectives defining the hyperparameter search space and training procedure now have tests.
Add intermediate results of TabTransformer and CatBoostClassifier 🐈 by @KarelZe in #71. Results as discussed in the last meeting with @CaroGrau.
Accelerate models with datapipes and torch.compile() 🚕 by @KarelZe in #64. Tested how the new features (datapipes and torch.compile()) could be used in my project. Still to early as discussed in the meeting with @CaroGrau.
Make calculations data parallel 🛣️ by @KarelZe in #77. All models can now be trained on multiple gpus in parallel, which should speed up training considerably. BwHPC provides up to four gpus that we can use. For gradient boosting, features are split among devices. For neural nets batches are split.
Add pruning support for Bayesian search 🧪 by @KarelZe in #78. I added support to prune unsuccessful trials in our Bayesian search. This should help with training and finding better solutions faster. Additional to the loss, the accuracy is also reported for all neural nets. Moreover, I integrated early stopping into the gradient boosting models, which should help to increase the performance. Also widened the hyperparameter search space for gradient boosted trees, which should help to find better solutions. Still have to verify with large studies on the cluster.

Writing 📖

Add questions for this week 📍 by @KarelZe in #70
Connect and expand notes 👩‍🚀 by @KarelZe in #65. Was able to slightly decrease the pile of papers. However, also found several new ones, like the linformer paper (https://arxiv.org/abs/2006.04768).

Other Changes

Bump google-auth from 2.14.1 to 2.15.0 by @dependabot in #66
Bump fastparquet from 2022.11.0 to 2022.12.0 by @dependabot in #69

Outlook 💪

Finalize notes on decision trees / gradient boosting. Prepare the first draft.
Update table of contents.
Go back to eda. Define new features based on papers. Revise existing ones based on kde plots.
Create a notebook to study feature transformations / scaling e. g., log transform / robust scaling, systematically.
Study learning curves for gradient boosting models and transformers with default configurations. Verify the settings for early stopping.
Perform adversarial validation more thoroughly. Answer questions like: which features drive the difference between training and test set? What aspect does time play? What would happen, if problematic features were excluded?
Increase test accuracy by 4 %.

Full Changelog: v0.2.4...v0.2.5

Contributors

KarelZe, dependabot, and CaroGrau

Assets 4

04 Dec 16:45

KarelZe

v0.2.4

59b364e

Changes between November, 28th and December, 4th

What's Changed

Empirical Study ⚗️

Exploratory data analysis 🍁 by @KarelZe in #40, #50, and #51. Refactored eda to training set only, included new features, and added new visualizations. Performed cv to study the results of a gradient-boosted model on different feature sets. Overall the results look promising, as the untuned gradient-boosted tree outperforms the respective classical counterparts on all three subsets. Performance is relatively stable across folds and validation and test set. Log transform and imputation make no difference for gradient-boosted trees, as expected. Problems like the high cardinality of some categorical values still need to be addressed. I also feel like more features or other transformations could help. Will do more research in classical literature on this.
Add model tracking and add saving of optuna.Study to wandb through callback💫 by @KarelZe in #55 and #63 by @KarelZe. All major data (data sets, models, studies, and some diagrams) is no tracked in wandb and saved to gcs. The use of callbacks makes the implementation of other parts like learning rate schedulers or stochastic weight averaging, much easier.
I experimented with further accelerating the TabTransformer through PyTorch 2.0 and datapipes done in #64 (WIP) @KarelZe. This is still in progress, as I wasn't able compile the model yet. Got a vague idea of possible solutions e. g., stripped down implementation or upgrade of cuda. Also, waiting could help, as PyTorch 2.0 is in early beta and has only been announced at the weekend. Didn't look into datapipes yet, which could help closing the gap for serving the gpu enough data. Also did some research on how high cardinality features like ROOT can be handled in the model to avoid an explosion in parameters. The later is necessary to train the model with reasonable performance.
Increased test coverage to 83 % or 45 tests in total. Writing these tests helped me discover some minor bugs e. g., in depth rule, I had previously overlooked. Tests were added for:
- Add tests for logic of ClassicalClassifier 🚑 by @KarelZe in #45
- Add tests for TabDataSet ⛑️ by @KarelZe in #46
- Add tests for TabDataLoader by @KarelZe in #47
- Add mixin and new tests for neural nets 🫗 by @KarelZe in #48
- Add parameterized tests and migrate to PyTest🍭 by @KarelZe in #49
Add new features to train, validation and test set 🏖️ by @KarelZe in #51
Refactor redundant code ClassicalClassifier 🧫 by @KarelZe in #52 and #53.
Remove RDB support 🦣 by @KarelZe in #54

Writing

Add new notes and connected existing ideas. I was able to reduce the pile of unread papers to 30. #65 (WIP) @KarelZe.

Other Changes

Add dependabot.yml for dependency bot 🦾 by @KarelZe in #56
Bump schneegans/dynamic-badges-action from 1.4.0 to 1.6.0 by @dependabot in #57
Bump typer from 0.6.1 to 0.7.0 by @dependabot in #62
Bump fastparquet from 0.8.3 to 2022.11.0 by @dependabot in #60

Outlook:

Pre-write first chapter of thesis
Set up plan for writing
Read 10 papers and add to zettelkasten
Turn eda notebook into scripts for feature generation and document with tests
Train TabTransformer and gradient boosted model until meeting with @CaroGrau
Further improve training performance of TabTransformer. Try out into datapipes and adjust implementation to be closer to paper. Decrease cardinality through nlp techniques

New Contributors

@dependabot made their first contribution in #57

Full Changelog: v0.2.3...v0.2.4

Contributors

KarelZe, dependabot, and CaroGrau

Assets 4

28 Nov 05:23

KarelZe

v0.2.3

e9bff60

Changes between November, 21st and November, 27th

What's Changed

Empirical Study ⚗️

Add TabTransformer baseline 🤖 by @KarelZe in #34. Involved implementation and documentation of the model, early stopping, data set and data loader. Most notably, I was able to speed up the implementation of https://github.com/kathrinse/TabSurvey/ by factor x9.8 (see notebook through an improved data loader, decoupling of training and data loading, and mixed precision support. Also tested were fused operations, pre-loading, and the use of pinned memory. An analysis with the PyTorch profiler reveals that the GPU is now less idle. Training on the entire data set is theoretically possible.
Fix classical rules🐞 by @KarelZe in #41. The issue came up during last week's discussion with @CaroGrau. The differences in accuracy are tiny. Usually < 1 %.
Add test cases for classical classifier ⛑️ by @KarelZe in #42. Tests are formal e. g., correct shapes of predictions or fitting behaviour.
Add implementation of CLNV method 🏖️ by @KarelZe in #43
Add tests for TabTransformer ⛑️ by @KarelZe in #44. Test for shapes of predictions, for parameter updates and convergence.

Writing 📖

Add questions for this weeks meeting ❓ by @KarelZe in #39
Researched techniques and new papers on speeding up transformers

Outlook 🔭

Read more again and minimize the stack of open papers (40+)
Better connect existing ideas in zettelkasten
Finish exploratory data analysis i. e., include new features, refactor to training data only, and do CV to better understand features
Improve test coverage i. e., data loader and classical rules

Full Changelog: v0.2.2...v0.2.3

Contributors

KarelZe and CaroGrau

Assets 4

20 Nov 18:38

KarelZe

v0.2.2

50df48b

Changes between November, 14th and November, 20th

What's Changed

Empirical Study ⚗️

Add tuneable implementation of gbm and classical rules 🐈‍⬛ by @KarelZe in #35 and #27. Models can now be trained using parametrized scripts i. e., python src/models/train_model.py --trials=5 --seed=42 --model=gbm --dataset=fbv/thesis/train_val_test_w_trade_size:v0 --features=ml. Data is loaded from versioned artefacts. Interrupted studies can now be continued at a later point in time. Added some tests for search. Bayesian search is implemented for gradient-boosted trees, TabTransformer, but also classical rules. Thereby, I was able to find combinations of classical rules previously not reported in the @CaroGrau paper.
Added TabTransformer implementation by @KarelZe in #34 (WIP). The current implementation performs binary classification and can be tuned using Bayesian search. Issues to resolve: improve utilization of accelerator, speed up training, and code quality. Plan to address these with PyTorch profiling, CUDA profiler, a custom PyTorch Dataset, and a bit of luck 🎲 .
Add basic docker support 🐳 by @KarelZe in #28. Docker image now available on docker hub.
Add compliance to pre-commit hooks 🪝 by @KarelZe in #33. Pre-commit hooks help to avoid potential bugs. Code in main is now fully-documented and annotated with type-hints.
Simplified project and test setup 🧯 by @KarelZe in #38. This greatly improves reproducibility and ease of development.

Writing 📖

Add notes to zettelkasten 🗃️ by @KarelZe in #29
Add proposal for feature sets 🧃 by @KarelZe in #31
Simplified and extended readme.md 🎍 by @KarelZe in #36
Finalized expose with numbers 🥳 by @KarelZe in #37

Outlook 🔭

Continue with exploratory data analysis and start with explanatory data analysis
Analyze low resource utilization and slow training of TabTransformer
Read more again and minimize the stack of open papers (30 +)
Better connect existing ideas in zettelkasten
Improve test coverage

Full Changelog: v0.2.1...v0.2.2

Contributors

KarelZe and CaroGrau

Assets 4

14 Nov 06:53

KarelZe

v0.2.1

e25d124

Changes between November, 7th and November, 13th

What's Changed

Empirical Study ⚗️

Added all classical rules from @CaroGrau paper. 🎉 Any rule can now be stacked together in an arbitrary order predict_rules(layers=[(trade_size,"ex"), (quote,"best"), (quote, "ex")], name="Tradesize + Quote (NBBO) + Quote (ISE)"). Minor differences in accuracy still exist due to a different handling of missing values by @KarelZe in #27 (WIP)
Started with a exploratory data analysis by @KarelZe in #25 (WIP)
Added proposal for feature set definition by @KarelZe in #29 (WIP)
Created a docker image to run code on bwUniCluster 2.0 and runpod by @KarelZe in #28 (WIP)
Did all the set up to connect to bwUniCluster 2.0

Writing 📖

Added new notes and updated questions 🗃️ by @KarelZe in #26 and #29 (WIP)

Outlook for Upcoming Week 🔭

implement a transformer-based baseline
rework hyperparameter searches to be interruptable
continue with exploratory data analysis and start with explanatory data analysis
add more notes to zettelkasten
better connect existing ideas in zettelkasten
finish WIPs

Full Changelog: v0.2.0...v0.2.1

Contributors

KarelZe and CaroGrau

Assets 4

06 Nov 16:58

KarelZe

v0.2.0

b3220ed

Changes between October, 31st and November, 6th

What's Changed

Empirical Study ⚗️

Improved adversarial validation and memory-constrained CSV loading ⛑️ by @KarelZe in #17, #18 and #24
Implementation of (promising) GBM baseline, Bayesian search, some classical rules and robustness checks🧸 by @KarelZe in #19
Experimented with training models on runpods.io to mitigate severe performance issues with Google Colab

Writing 📖

Restructured readme ☄️ by @KarelZe in #21 and #22
Added 15+ literature notes to zettelkasten 🗃️ by @KarelZe in #23 and #20
Researched additional 30+ papers to read in the next week

Outlook for Upcoming Week 🔭

start with explanatory data analysis
investigate differences in the accuracy of classical rules with regard to @CaroGrau paper
start implementing a transformer-based baseline
add more notes to zettelkasten
bundle training scripts in docker container

Full Changelog: v0.1.9...v0.2.0

Contributors

KarelZe and CaroGrau

Assets 4

31 Oct 13:09

KarelZe

v0.1.9

8891ac9

Changes between 27th October and 30th October

What's Changed

Empirical Study ⚗️

Set up Google Cloud Storage and Google Colab.
Loaded csv data into pandas data frame, inferred dtypes, performed optimizations and exported into .parquet chunks by @KarelZe in #12.
Added data set versioning using weights & biases.
Created sub samples e. g., 2015 and train, validation, and test set.
Cleaned up requirements / fix version in requirements.txt.
Created tests / assertations against @CaroGrau paper 2.0-mb-data_preprocessing_loading_splitting.ipynb.
Ran adversarial validation in 2.0-mb-data_preprocessing_loading_splitting.ipynb.

Writing 📖

Added more notes to Zettelkasten e. g., 6f704ff.

Other changes

Added script auto-generate release notes 📯 by @KarelZe in #13.

Full Changelog: v.0.1.6...v0.1.9

Contributors

KarelZe and CaroGrau

Assets 4

Releases: KarelZe/thesis

Changes between December, 19th and December, 25th

What's Changed

Empirical Study ⚗️

Writing 📖

Outlook 🧪

Contributors

Changes between December, 12th and December, 18th

What's Changed

Empirical Study ⚗️

Writing 📖

Outlook 🎆

Contributors

Changes between December, 5th and December, 11th

What's Changed

Empirical Study ⚗️

Writing 📖

Other Changes

Outlook 💪

Contributors

Changes between November, 28th and December, 4th

What's Changed

Empirical Study ⚗️

Writing

Other Changes

Outlook:

New Contributors

Contributors

Changes between November, 21st and November, 27th

What's Changed

Empirical Study ⚗️

Writing 📖

Outlook 🔭

Contributors

Changes between November, 14th and November, 20th

What's Changed

Empirical Study ⚗️

Writing 📖

Outlook 🔭

Contributors

Changes between November, 7th and November, 13th

What's Changed

Empirical Study ⚗️

Writing 📖

Outlook for Upcoming Week 🔭

Contributors

Changes between October, 31st and November, 6th

What's Changed

Empirical Study ⚗️

Writing 📖

Outlook for Upcoming Week 🔭

Contributors

Changes between 27th October and 30th October

What's Changed

Empirical Study ⚗️

Writing 📖

Other changes

Contributors