diff --git a/README.md b/README.md index 4eedac8..1fe7ca9 100644 --- a/README.md +++ b/README.md @@ -1,55 +1,14 @@ # `autofeat` library -### Linear Prediction Models with Automated Feature Engineering and Selection -This library contains the `AutoFeatRegressor` and `AutoFeatClassifier` models with a similar interface as `scikit-learn` models: -- `fit()` function to fit the model parameters -- `predict()` function to predict the target variable given the input -- `predict_proba()` function to predict probabilities of the target variable given the input (classifier only) -- `score()` function to calculate the goodness of the fit (R^2/accuracy) -- `fit_transform()` and `transform()` functions, which extend the given data by the additional features that were engineered and selected by the model +This library contains `sklearn`-compatible linear prediction models with automated feature engineering and selection capabilities. -When calling the `fit()` function, internally the `fit_transform()` function will be called, so if you're planing to call `transform()` on the same data anyways, just call `fit_transform()` right away. `transform()` is mostly useful if you've split your data into training and test data and did not call `fit_transform()` on your whole dataset. The `predict()` and `score()` functions can either be given data in the format of the original dataframe that was used when calling `fit()`/`fit_transform()` or they can be given an already transformed dataframe. - -In addition, only the feature selection part is also available in the `FeatureSelector` model. - -Furthermore (as of version 2.0.0), minimal feature selection (removing zero variance and redundant features), engineering (simple product and ratio of features), and scaling (power transform to make features more normally distributed) is also available in the `AutoFeatLight` model. - -The `AutoFeatRegressor`, `AutoFeatClassifier`, and `FeatureSelector` models need to be **fit on data without NaNs**, as they internally call the sklearn `LassoLarsCV` model, which can not handle NaNs. When calling `transform()`, NaNs (but not `np.inf`) are okay. - -The [autofeat examples notebook](https://github.com/cod3licious/autofeat/blob/main/notebooks/autofeat_examples.ipynb) contains a simple usage example - try it out! :) Additional examples can be found in the autofeat benchmark notebooks for [regression](https://github.com/cod3licious/autofeat/blob/main/notebooks/autofeat_benchmark_regression.ipynb) (which also contains the code to reproduce the results from the paper mentioned below) and [classification](https://github.com/cod3licious/autofeat/blob/main/notebooks/autofeat_benchmark_classification.ipynb), as well as the testing scripts. - -Please keep in mind that since the `AutoFeatRegressor` and `AutoFeatClassifier` models can generate very complex features, they might **overfit on noise** in the dataset, i.e., find some new features that lead to good prediction on the training set but result in a poor performance on new test samples. While this usually only happens for datasets with very few samples, we suggest you carefully inspect the features found by `autofeat` and use those that make sense to you to train your own models. - -Depending on the number of `feateng_steps` (default 2) and the number of input features, `autofeat` can generate a very huge feature matrix (before selecting the most appropriate features from this large feature pool). By specifying in `feateng_cols` those columns that you expect to be most valuable in the feature engineering part, the number of features can be greatly reduced. Additionally, `transformations` can be limited to only those feature transformations that make sense for your data. Last but not least, you can subsample the data used for training the model to limit the memory requirements. After the model was fit, you can call `transform()` on your whole dataset to generate only those few features that were selected during `fit()`/`fit_transform()`. - - -### Installation -You can either download the code from here and include the autofeat folder in your `$PYTHONPATH` or install (the library components only) via pip: - - $ pip install autofeat - -The library requires Python 3! Other dependencies: `numpy`, `pandas`, `scikit-learn`, `sympy`, `joblib`, `pint` and `numba`. - - -### Paper -For further details on the model and implementation please refer to the [paper](https://arxiv.org/abs/1901.07329) - and of course if any of this code was helpful for your research, please consider citing it: -``` -@inproceedings{horn2019autofeat, - title={The autofeat Python Library for Automated Feature Engineering and Selection}, - author={Horn, Franziska and Pack, Robert and Rieger, Michael}, - booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases}, - pages={111--120}, - year={2019}, - organization={Springer} -} -``` - -If you don't like reading, you can also watch a video of my [talk at the PyData conference](https://www.youtube.com/watch?v=4-4pKPv9lJ4) about automated feature engineering and selection with `autofeat`. +For more information please have a look at the [documentation]https://franziskahorn.de/autofeat. The code is intended for research purposes. If you have any questions please don't hesitate to send me an [email](mailto:cod3licious@gmail.com) and of course if you should find any bugs or want to contribute other improvements, pull requests are very welcome! + ### Acknowledgments This project was made possible thanks to support by [BASF](https://www.basf.com). diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md new file mode 100644 index 0000000..75d1df9 --- /dev/null +++ b/docs/CHANGELOG.md @@ -0,0 +1,145 @@ +# Changelog + + +### 2.1.3 (2024-07-04) + +- minor style and type fixes +- improved package structure +- added changelog & docs + + +### 2.1.2 (2023-07-28) + +- converted most print statements to logging outputs +- moved tests and make pytest compatible +- more qa and style fixes (using ruff) + + +### 2.1.1 (2023-06-25) + +- fixed annotations for backwards compatibility + + +### 2.1.0 (2023-05-14) + +- added `predict_proba` functionality for classifier (by @mglowacki100) +- formatting fixes (using black) +- added type hints + + +### 2.0.10 (2021-10-28) + +- fixed issue #29 (by @stephanos-stephani) + + +### 2.0.9 (2021-06-12) + +- speed up correlation computation; fixes issue #28 + + +### 2.0.8 (2021-06-03) + +- use numba jit for feature generation (by @jeethu) + + +### 2.0.7 (2021-06-02) + +- use numba for standardization (by @jeethu) + + +### 2.0.5 (2021-01-16) + +- fixed TypeError while running tests with scikit-learn 0.24.0 (by @jeethu) +- minor efficiency improvements in apply_transformations (by @jeethu) +- use numba to accelerate feateng (by @jeethu) + + +### 2.0.4 (2020-11-30) + +- update sympy call to work with new version + + +### 2.0.3 (2020-11-11) + +- turn scaling off by default +- remove more correlated cols by starting with the features that has the most correlated columns + + +### 2.0.2 (2020-11-11) + +- fixed typo + + +### 2.0.1 (2020-11-11) + +- use correlation threshold in autofeat light as parameter + + +### 2.0.0 (2020-11-07) + +- added `AutoFeatLight` model for simple feature selection (removing zero variance and redundant features), engineering (product and ratio of original features) and power transform to make features more normally distributed + + +### 1.1.3 (2020-07-21) + +- categorical columns can contain strings now + + +### 1.1.2 (2020-02-28) + +- don't generate addition/subtr features at the highest level, i.e., if they would just be removed anyways + + +### 1.1.1 (2020-02-25) + +- use LassoLarsCV instead of RidgeCV as final regression model +- minor tweaks to feature selection to avoid longer formulas + + +### 1.1.0 (2020-02-24) + +- include categorical columns for feateng by default +- add correlation filtering back into feat selection + + +### 1.0.0 (2020-02-24) + +- changed autofeat model to differentiate between regression and classification tasks, adding the `AutoFeatRegressor` and `AutoFeatClassifier` classes +- simplified feature selection process + + +### 0.2.5 (2019-05-12) + +- more robust featsel with noise filtering + + +### 0.2.2 (2019-05-09) + +- change default value for `feateng_steps` to 2, in line with results on realworld datasets + + +### 0.2.1 (2019-05-09) + +- make feature selection less prone to overfitting + + +### 0.2.0 (2019-05-02) + +- add `FeatureSelector` class to use feature selection separately +- make feature selection more robust and move into featsel +- make the models more sklearn like and test with sklearn estimator tests +- replace sympy's ufuncify with lambdify +- better logs +- use immutable default arguments +- make pi theorem optional +- handle nans in transform + + +### 0.1.1 (2019-01-23) + +- updated documentation + + +### 0.1.0 (2019-01-22) + +- initial release with regression model diff --git a/docs/index.md b/docs/index.md index 000ea34..3cae6e8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,17 +1,32 @@ -# Welcome to MkDocs +# Getting Started -For full documentation visit [mkdocs.org](https://www.mkdocs.org). +This library contains the `AutoFeatRegressor` and `AutoFeatClassifier` models with a similar interface as `scikit-learn` models: -## Commands +- `fit()` function to fit the model parameters +- `predict()` function to predict the target variable given the input +- `predict_proba()` function to predict probabilities of the target variable given the input (classifier only) +- `score()` function to calculate the goodness of the fit (R^2/accuracy) +- `fit_transform()` and `transform()` functions, which extend the given data by the additional features that were engineered and selected by the model -* `mkdocs new [dir-name]` - Create a new project. -* `mkdocs serve` - Start the live-reloading docs server. -* `mkdocs build` - Build the documentation site. -* `mkdocs -h` - Print help message and exit. +When calling the `fit()` function, internally the `fit_transform()` function will be called, so if you're planing to call `transform()` on the same data anyways, just call `fit_transform()` right away. `transform()` is mostly useful if you've split your data into training and test data and did not call `fit_transform()` on your whole dataset. The `predict()` and `score()` functions can either be given data in the format of the original dataframe that was used when calling `fit()`/`fit_transform()` or they can be given an already transformed dataframe. -## Project layout +In addition, only the feature selection part is also available in the `FeatureSelector` model. + +Furthermore (as of version 2.0.0), minimal feature selection (removing zero variance and redundant features), engineering (simple product and ratio of features), and scaling (power transform to make features more normally distributed) is also available in the `AutoFeatLight` model. + +The `AutoFeatRegressor`, `AutoFeatClassifier`, and `FeatureSelector` models need to be **fit on data without NaNs**, as they internally call the sklearn `LassoLarsCV` model, which can not handle NaNs. When calling `transform()`, NaNs (but not `np.inf`) are okay. + +The [autofeat examples notebook](https://github.com/cod3licious/autofeat/blob/main/notebooks/autofeat_examples.ipynb) contains a simple usage example - try it out! :) Additional examples can be found in the autofeat benchmark notebooks for [regression](https://github.com/cod3licious/autofeat/blob/main/notebooks/autofeat_benchmark_regression.ipynb) (which also contains the code to reproduce the results from the paper mentioned below) and [classification](https://github.com/cod3licious/autofeat/blob/main/notebooks/autofeat_benchmark_classification.ipynb), as well as the testing scripts. + +Please keep in mind that since the `AutoFeatRegressor` and `AutoFeatClassifier` models can generate very complex features, they might **overfit on noise** in the dataset, i.e., find some new features that lead to good prediction on the training set but result in a poor performance on new test samples. While this usually only happens for datasets with very few samples, we suggest you carefully inspect the features found by `autofeat` and use those that make sense to you to train your own models. + +Depending on the number of `feateng_steps` (default 2) and the number of input features, `autofeat` can generate a very huge feature matrix (before selecting the most appropriate features from this large feature pool). By specifying in `feateng_cols` those columns that you expect to be most valuable in the feature engineering part, the number of features can be greatly reduced. Additionally, `transformations` can be limited to only those feature transformations that make sense for your data. Last but not least, you can subsample the data used for training the model to limit the memory requirements. After the model was fit, you can call `transform()` on your whole dataset to generate only those few features that were selected during `fit()`/`fit_transform()`. + + +### Installation +You can either download the code from here and include the autofeat folder in your `$PYTHONPATH` or install (the library components only) via pip: + + $ pip install autofeat + +The library requires Python 3! Other dependencies: `numpy`, `pandas`, `scikit-learn`, `sympy`, `joblib`, `pint` and `numba`. - mkdocs.yml # The configuration file. - docs/ - index.md # The documentation homepage. - ... # Other markdown pages, images and other files. diff --git a/docs/science.md b/docs/science.md new file mode 100644 index 0000000..6051295 --- /dev/null +++ b/docs/science.md @@ -0,0 +1,17 @@ +# The Science + +### Paper +For further details on the model and implementation please refer to the [paper](https://arxiv.org/abs/1901.07329) - and of course if any of this code was helpful for your research, please consider citing it: +``` +@inproceedings{horn2019autofeat, + title={The autofeat Python Library for Automated Feature Engineering and Selection}, + author={Horn, Franziska and Pack, Robert and Rieger, Michael}, + booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases}, + pages={111--120}, + year={2019}, + organization={Springer} +} +``` + +### PyData Talk +If you don't like reading, you can also watch a video of my [talk at the PyData conference](https://www.youtube.com/watch?v=4-4pKPv9lJ4) about automated feature engineering and selection with `autofeat`. diff --git a/mkdocs.yml b/mkdocs.yml index 3b08b4c..c1d2f12 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,4 +1,15 @@ site_name: autofeat docs site_url: https://franziskahorn.de/autofeat +repo_url: https://github.com/cod3licious/autofeat +repo_name: autofeat@GitHub theme: name: material + palette: + primary: teal + icon: + repo: fontawesome/brands/github + logo: fontawesome/solid/shield-cat +nav: + - Getting Started: index.md + - The Science: science.md + - Changelog: CHANGELOG.md diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 0000000..8262519 --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,140 @@ +[build-system] +requires = ["poetry-core"] +build-backend = "poetry.core.masonry.api" + +[tool.poetry] +name = "autofeat" +version = "2.1.3" +description = "Automatic Feature Engineering and Selection Linear Prediction Model" +authors = ["Franziska Horn "] +readme = "README.md" +packages = [{include = "autofeat", from = "src"}] +license = "MIT" +keywords = ["automl", "feature engineering", "feature selection", "linear model"] +repository = "https://github.com/cod3licious/autofeat" +homepage = "https://franziskahorn.de/autofeat" + +[tool.poetry.dependencies] +python = "^3.8.1,<3.13" +numpy = "^1.20.3" +numba = ">=0.53.1" +joblib = "^1.2.0" +pandas = ">=1.3.5,<3.0.0" +pint = ">=0.17,<1.0" +scipy = "^1.7.3" +scikit-learn = "^1.2.0" +sympy = "^1.7.1" + +[tool.poetry.group.dev.dependencies] +bandit = "^1.7.7" +ipython = ">=8.0.0" +notebook = "^6.5.0" +matplotlib = "^3.7.2" +mkdocs-material = "^9.5.28" +mypy = "^1.7.1" +poethepoet = ">=0.24.4" +pytest = "^7.4.0" +pyupgrade = "^3.9.0" +ruff = ">=0.2.1" + + +[tool.poe.tasks] +# run with `poetry run poe format` +format = "bash -c 'pyupgrade --py38-plus $(find **/*.py) && ruff check --fix . && ruff format .'" +check = "bash -c 'ruff check . && mypy src/autofeat && bandit -c pyproject.toml -r .'" +test = "bash -c 'pytest tests'" + + +[tool.ruff] +target-version = "py38" +line-length = 128 + +# Exclude a variety of commonly ignored directories. +exclude = [ + ".eggs", + ".git", + ".ipynb_checkpoints", + ".mypy_cache", + ".pytest_cache", + ".pytype", + ".ruff_cache", + ".venv", + "__pypackages__", + "__pycache__", + "build", + "dist", +] + +[tool.ruff.lint] +select = ["A", "B", "C4", "D", "E", "F", "G", "I", "N", "Q", "W", "COM", "DTZ", "FA", "ICN", "INP", "PIE", "PD", "PL", "RSE", "RET", "RUF", "SIM", "SLF", "UP"] + +# Allow autofix for all enabled rules (when `--fix`) is provided. +fixable = ["C4", "D", "E", "G", "I", "Q", "W", "COM", "PD", "RSE", "RET", "RUF", "SIM", "SLF", "UP"] +# Avoid trying to fix flake8-bugbear (`B`) violations. +unfixable = ["B", "F841"] + +# Ignore a few rules that we consider too strict. +ignore = ["E501", # Line too long + "E741", # Ambiguous variable name: `l` + "PD901", # 'df' is a bad variable name + "N999", # Invalid module name: '🏠_Home' + "N802", "N803", "N806", # names should be lowercase + "D1", # D100 - D107: Missing docstrings + "D212", # Multi-line docstring summary should start at the second line + "D400", # adds a period at the end of line (problematic when it is a path) + "D415", # First line should end with a period, question mark, or exclamation point + "D203", "D204", "D205", # required blank lines + "G004", # Logging statement uses f-string + "PIE790", # Unnecessary `pass` statement + "PLR2004", # Magic value used in comparison, consider replacing 0.999 with a constant variable + "PLR09", # Too many arguments to function call + "COM812", # trailing comma - don't use together with formatter +] + +# Ignore `E402` (import violations) in all `__init__.py` files, and in `path/to/file.py`. +[tool.ruff.lint.per-file-ignores] +"__init__.py" = ["E402"] + +[tool.ruff.lint.pydocstyle] +convention = "google" + +[tool.ruff.lint.isort] +known-first-party = ["autofeat", "autofeat.*"] +section-order = ["future", "standard-library", "third-party", "first-party", "tests", "local-folder"] + +[tool.ruff.lint.isort.sections] +"tests" = ["tests"] + +[tool.ruff.lint.flake8-import-conventions] + +[tool.mypy] +plugins = ["numpy.typing.mypy_plugin"] + +[[tool.mypy.overrides]] +module = [ + "pandas.*", + "sklearn.*", + "joblib.*", + "scipy.*", + "numpy", + "numba", + "pandas.*", + "streamlit.*", + "matplotlib.*", + "IPython.*", + "plotly.*", + "seaborn.*", + "requests.*", + "sqlalchemy.*" +] +ignore_missing_imports = true + +[tool.bandit] +targets = ["src/autofeat/"] +recursive = true +skips = ["B101"] + +[tool.pytest.ini_options] +minversion = "6.0" +addopts = "--disable-warnings" +markers = ["slow"] diff --git a/src/autofeat/__init__.py b/src/autofeat/__init__.py index 52469f1..c595e38 100644 --- a/src/autofeat/__init__.py +++ b/src/autofeat/__init__.py @@ -2,7 +2,7 @@ # License: MIT name = "autofeat" -__version__ = "2.1.2" +__version__ = "2.1.3" from .autofeatlight import AutoFeatLight # noqa from .autofeat import AutoFeatModel, AutoFeatRegressor, AutoFeatClassifier # noqa from .featsel import FeatureSelector # noqa