Skip to content

Changes between November, 28th and December, 4th

Compare
Choose a tag to compare
@KarelZe KarelZe released this 04 Dec 16:45
· 309 commits to main since this release
59b364e

What's Changed

Empirical Study ⚗️

  • Exploratory data analysis 🍁 by @KarelZe in #40, #50, and #51. Refactored eda to training set only, included new features, and added new visualizations. Performed cv to study the results of a gradient-boosted model on different feature sets. Overall the results look promising, as the untuned gradient-boosted tree outperforms the respective classical counterparts on all three subsets. Performance is relatively stable across folds and validation and test set. Log transform and imputation make no difference for gradient-boosted trees, as expected. Problems like the high cardinality of some categorical values still need to be addressed. I also feel like more features or other transformations could help. Will do more research in classical literature on this.
  • Add model tracking and add saving of optuna.Study to wandb through callback💫 by @KarelZe in #55 and #63 by @KarelZe. All major data (data sets, models, studies, and some diagrams) is no tracked in wandb and saved to gcs. The use of callbacks makes the implementation of other parts like learning rate schedulers or stochastic weight averaging, much easier.
  • I experimented with further accelerating the TabTransformer through PyTorch 2.0 and datapipes done in #64 (WIP) @KarelZe. This is still in progress, as I wasn't able compile the model yet. Got a vague idea of possible solutions e. g., stripped down implementation or upgrade of cuda. Also, waiting could help, as PyTorch 2.0 is in early beta and has only been announced at the weekend. Didn't look into datapipes yet, which could help closing the gap for serving the gpu enough data. Also did some research on how high cardinality features like ROOT can be handled in the model to avoid an explosion in parameters. The later is necessary to train the model with reasonable performance.
  • Increased test coverage to 83 % or 45 tests in total. Writing these tests helped me discover some minor bugs e. g., in depth rule, I had previously overlooked. Tests were added for:
    • Add tests for logic of ClassicalClassifier 🚑 by @KarelZe in #45
    • Add tests for TabDataSet ⛑️ by @KarelZe in #46
    • Add tests for TabDataLoader by @KarelZe in #47
    • Add mixin and new tests for neural nets 🫗 by @KarelZe in #48
    • Add parameterized tests and migrate to PyTest🍭 by @KarelZe in #49
  • Add new features to train, validation and test set 🏖️ by @KarelZe in #51
  • Refactor redundant code ClassicalClassifier 🧫 by @KarelZe in #52 and #53.
  • Remove RDB support 🦣 by @KarelZe in #54

Writing

  • Add new notes and connected existing ideas. I was able to reduce the pile of unread papers to 30. #65 (WIP) @KarelZe.

Other Changes

Outlook:

  • Pre-write first chapter of thesis
  • Set up plan for writing
  • Read 10 papers and add to zettelkasten
  • Turn eda notebook into scripts for feature generation and document with tests
  • Train TabTransformer and gradient boosted model until meeting with @CaroGrau
  • Further improve training performance of TabTransformer. Try out into datapipes and adjust implementation to be closer to paper. Decrease cardinality through nlp techniques

New Contributors

Full Changelog: v0.2.3...v0.2.4