Changes between November, 28th and December, 4th
What's Changed
Empirical Study ⚗️
- Exploratory data analysis 🍁 by @KarelZe in #40, #50, and #51. Refactored eda to training set only, included new features, and added new visualizations. Performed cv to study the results of a gradient-boosted model on different feature sets. Overall the results look promising, as the untuned gradient-boosted tree outperforms the respective classical counterparts on all three subsets. Performance is relatively stable across folds and validation and test set. Log transform and imputation make no difference for gradient-boosted trees, as expected. Problems like the high cardinality of some categorical values still need to be addressed. I also feel like more features or other transformations could help. Will do more research in classical literature on this.
- Add model tracking and add saving of
optuna.Study
towandb
through callback💫 by @KarelZe in #55 and #63 by @KarelZe. All major data (data sets, models, studies, and some diagrams) is no tracked inwandb
and saved togcs
. The use of callbacks makes the implementation of other parts like learning rate schedulers or stochastic weight averaging, much easier. - I experimented with further accelerating the
TabTransformer
throughPyTorch 2.0
anddatapipes
done in #64 (WIP) @KarelZe. This is still in progress, as I wasn't able compile the model yet. Got a vague idea of possible solutions e. g., stripped down implementation or upgrade of cuda. Also, waiting could help, asPyTorch 2.0
is in early beta and has only been announced at the weekend. Didn't look intodatapipes
yet, which could help closing the gap for serving the gpu enough data. Also did some research on how high cardinality features likeROOT
can be handled in the model to avoid an explosion in parameters. The later is necessary to train the model with reasonable performance. - Increased test coverage to 83 % or 45 tests in total. Writing these tests helped me discover some minor bugs e. g., in depth rule, I had previously overlooked. Tests were added for:
- Add new features to train, validation and test set 🏖️ by @KarelZe in #51
- Refactor redundant code
ClassicalClassifier
🧫 by @KarelZe in #52 and #53. - Remove RDB support 🦣 by @KarelZe in #54
Writing
- Add new notes and connected existing ideas. I was able to reduce the pile of unread papers to 30. #65 (WIP) @KarelZe.
Other Changes
- Add
dependabot.yml
for dependency bot 🦾 by @KarelZe in #56 - Bump schneegans/dynamic-badges-action from 1.4.0 to 1.6.0 by @dependabot in #57
- Bump typer from 0.6.1 to 0.7.0 by @dependabot in #62
- Bump fastparquet from 0.8.3 to 2022.11.0 by @dependabot in #60
Outlook:
- Pre-write first chapter of thesis
- Set up plan for writing
- Read 10 papers and add to zettelkasten
- Turn eda notebook into scripts for feature generation and document with tests
- Train TabTransformer and gradient boosted model until meeting with @CaroGrau
- Further improve training performance of TabTransformer. Try out into
datapipes
and adjust implementation to be closer to paper. Decrease cardinality through nlp techniques
New Contributors
- @dependabot made their first contribution in #57
Full Changelog: v0.2.3...v0.2.4