From 6fd22d7fd9134574b60f8722f68ae3dbc68c3c5c Mon Sep 17 00:00:00 2001 From: Villu Ruusmann Date: Fri, 29 Mar 2024 21:12:50 +0200 Subject: [PATCH] Updated documentation --- NEWS.md | 118 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 6 +-- 2 files changed, 121 insertions(+), 3 deletions(-) diff --git a/NEWS.md b/NEWS.md index 50d629f..0fca8e8 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,121 @@ +# 0.105.1 # + +## Breaking changes + +None + +## New features + +* Added support for [`sklearn.preprocessing.TargetEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html) class. + +* Added support for [`sklearn.preprocessing.SplineTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.SplineTransformer.html) class. + +The `SplineTransformer` class computes a B-spline for a feature, which is then used to expand the feature into new features that correspond to B-spline basis elements. + +This class is not suitable for simple feature and prediction scaling purposes (eg. calibration of computer probabilities). +Consider using the `sklearn2pmml.preprocessing.BSplineTransformer` class in such a situation. + +* Added support for [`statsmodels.api.QuantReg`](https://www.statsmodels.org/dev/generated/statsmodels.regression.quantile_regression.QuantReg.html) class. + +* Added `input_float` conversion option. + +Scikit-Learn tree and tree ensemble models prepare their inputs by first casting them to `(numpy.)float32`, and then to `(numpy.)float64` (exactly so, even if the input value already happened to be of `(numpy.)float64` data type). + +PMML does not provide effective means for implementing "chained casts"; the chain must be broken down into elementary cast operations, each of which is represented using a standalone `DerivedField` element. +For example, preparing the "Sepal.Length" field of the iris dataset: + +``` xml + + + + + + + + + + + + + + + +``` + +Activating the `input_float` conversion option: + +``` python +pipeline = PMMLPipeline([ + ("classifier", DecisionTreeClassifier()) +]) +pipeline.fit(iris_X, iris_y) + +# Default mode +pipeline.configure(input_float = False) +sklearn2pmml("DecisionTree-default.pmml") + +# "Input float" mode +pipeline.configure(input_float = True) +sklearn2pmml("DecisionTree-input_float.pmml") +``` + +This conversion option updates the data type of the "Sepal.Length" data field from `double` to `float`, thereby eliminating the need for the first `DerivedField` element of the two: + +``` xml + + + + + + + + + + + + +``` + +Changing the data type of a field may have side effects if the field contributes to more than one feature. +The effectiveness and safety of configuration options should be verified by integration testing. + +* Added `H2OEstimator.pmml_classes_` attribute. + +This attribute allows customizing target category levels. +It comes in handly when working with ordinal targets, where the H2O.ai framework requires that target category levels are encoded from their original representation to integer index representation. + +A fitted H2O.ai ordinal classifier predicts integer indices, which must be manually decoded in the application layer. +The JPMML-SkLearn library is able to "erase" this encode-decode helper step from the workflow, resulting in a clean and efficient PMML document: + +``` python +ordinal_classifier = H2OGeneralizedLinearEstimator(family = "ordinal") +ordinal_classifier.fit(...) + +# Customize target category levels +# Note that the default lexicographic ordering of labels is different from their intended ordering +ordinal_classifier.pmml_classes_ = ["bad", "poor", "fair", "good", "excellent"] + +sklearn2pmml(ordinal_classifier, "OrdinalClassifier.pmml") +``` + +## Minor improvements and fixes + +* Fixed the categorical encoding of missing values. + +This bug manifested itself when the input column was mixing different data type values. +For example, a sparse string column, where non-missing values are strings, and missing values are floating-point `numpy.NaN` values. + +Scikit-Learn documentation warns against mixing string and numeric values within a single column, but it can happen inadvertently when reading a sparse dataset into a Pandas' DataFrame using standard library functions (eg. the [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function). + +* Added Pandas to package dependencies. + +See [SkLearn2PMML-418](https://github.com/jpmml/sklearn2pmml/issues/418) + +* Ensured compatibility with H2O.ai 3.46.0.1. + +* Ensured compatibility with BorutaPy 0.3.post0 (92e4b4e). + + # 0.105.0 # ## Breaking changes diff --git a/README.md b/README.md index 7400513..514a5c5 100644 --- a/README.md +++ b/README.md @@ -9,13 +9,13 @@ This package is a thin Python wrapper around the [JPMML-SkLearn](https://github. # News and Updates # -The current version is **0.105.0** (21 March, 2024): +The current version is **0.105.1** (29 March, 2024): ``` -pip install sklearn2pmml==0.105.0 +pip install sklearn2pmml==0.105.1 ``` -See the [NEWS.md](https://github.com/jpmml/sklearn2pmml/blob/master/NEWS.md#01050) file. +See the [NEWS.md](https://github.com/jpmml/sklearn2pmml/blob/master/NEWS.md#01051) file. # Prerequisites #