Skip to content

Commit

Permalink
[YDF] Prepare release of PYDF 0.8.0
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 677766701
  • Loading branch information
rstz authored and copybara-github committed Sep 23, 2024
1 parent 967aab7 commit a89064f
Show file tree
Hide file tree
Showing 12 changed files with 45 additions and 69 deletions.
23 changes: 1 addition & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
[![PyPI Downloads](https://img.shields.io/pypi/dm/ydf?style=flat-square)](https://pepy.tech/project/ydf)

**YDF** (Yggdrasil Decision Forests) is a library to train, evaluate, interpret,
and serve Random Forest, Gradient Boosted Decision Trees, and CART decision
and serve Random Forest, Gradient Boosted Decision Trees, CART and Isolation
forest models.

See the [documentation](https://ydf.readthedocs.org/) for more information on
Expand Down Expand Up @@ -84,27 +84,6 @@ SaveModel("my_model", model.get());
(based on [examples/beginner.cc](examples/beginner.cc))
The same model can be trained in Python using TensorFlow Decision Forests as
follows:
```python
import tensorflow_decision_forests as tfdf
import pandas as pd
# Load dataset in a Pandas dataframe.
train_df = pd.read_csv("project/train.csv")
# Convert dataset into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="my_label")
# Train model
model = tfdf.keras.RandomForestModel()
model.fit(train_ds)
# Export model.
model.save("project/model")
```

## Next steps
Check the
Expand Down
21 changes: 11 additions & 10 deletions documentation/public/docs/hyperparameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,8 +137,8 @@ reasonable time.

- **Type:** Integer **Default:** 5 **Possible values:** min:1

- Truncation of the cross-entropy NDCG loss. Only used with cross-entropy NDCG
loss i.e. `loss="XE_NDCG_MART"`
- Truncation of the cross-entropy NDCG loss (default 5). Only used with
cross-entropy NDCG loss i.e. `loss="XE_NDCG_MART"`

#### [dart_dropout](https://github.com/google/yggdrasil-decision-forests/blob/main/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.proto)

Expand Down Expand Up @@ -172,18 +172,19 @@ reasonable time.

- **Type:** Real **Default:** 0.5 **Possible values:** min:0 max:1

- EXPERIMENTAL. Weighting parameter for focal loss, positive samples weighted
by alpha, negative samples by (1-alpha). The default 0.5 value means no
active class-level weighting. Only used with focal loss i.e.
- EXPERIMENTAL, default 0.5. Weighting parameter for focal loss, positive
samples weighted by alpha, negative samples by (1-alpha). The default 0.5
value means no active class-level weighting. Only used with focal loss i.e.
`loss="BINARY_FOCAL_LOSS"`

#### [focal_loss_gamma](https://github.com/google/yggdrasil-decision-forests/blob/main/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.proto)

- **Type:** Real **Default:** 2 **Possible values:** min:0

- EXPERIMENTAL. Exponent of the misprediction exponent term in focal loss,
corresponds to gamma parameter in https://arxiv.org/pdf/1708.02002.pdf. Only
used with focal loss i.e. `loss="BINARY_FOCAL_LOSS"`
- EXPERIMENTAL, default 2.0. Exponent of the misprediction exponent term in
focal loss, corresponds to gamma parameter in
https://arxiv.org/pdf/1708.02002.pdf. Only used with focal loss i.e.
`loss="BINARY_FOCAL_LOSS"`

#### [forest_extraction](https://github.com/google/yggdrasil-decision-forests/blob/main/yggdrasil_decision_forests/learner/gradient_boosted_trees/gradient_boosted_trees.proto)

Expand Down Expand Up @@ -365,8 +366,8 @@ reasonable time.

- **Type:** Integer **Default:** 5 **Possible values:** min:1

- Truncation of the NDCG loss. Only used with NDCG loss i.e.
`loss="LAMBDA_MART_NDCG"`
- Truncation of the NDCG loss (default 5). Only used with NDCG loss i.e.
`loss="LAMBDA_MART_NDCG".`

#### [num_candidate_attributes](https://github.com/google/yggdrasil-decision-forests/blob/main/yggdrasil_decision_forests/learner/decision_tree/decision_tree.proto)

Expand Down
11 changes: 7 additions & 4 deletions yggdrasil_decision_forests/port/python/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Changelog

## HEAD
## 0.8.0 - 2024-09-23

### Breaking

Expand All @@ -22,7 +22,7 @@
- Add `num_examples_per_tree()` method to Isolation Forest models.
- Expose the slow engine for debugging predictions and evaluations with
`use_slow_engine=True`.
- Speed-up training of GBT models by ~10%
- Speed-up training of GBT models by ~10%.
- Support for categorical and boolean features in Isolation Forests.
- Add `ydf.util.read_tf_record` and `ydf.util.write_tf_record` to facilitate
TF Record datasets usage.
Expand All @@ -36,14 +36,17 @@
- Add argument to control the maximum duration of `model.analyze`.
- Add support for Unicode strings, normalize categorical set values in the
same way as categorical values, and validate their types.
- Native support for PyGrain DataLoader and Dataset for all operations (e.g.,
training, evaluation, predictions).
- Add support for distributed training for ranking gradient boosted tree
models.

### Fix

- Fix labels of regression evaluation plots
- Improved errors if Isolation Forest training fails.

### Release music

Perpetuum Mobile "Ein musikalischer Scherz", Op. 257. Johann Strauss (Sohn)

## 0.7.0 - 2024-08-21

Expand Down
18 changes: 4 additions & 14 deletions yggdrasil_decision_forests/port/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ Decision Forests. It allows direct, fast access to YDF's methods and it also
offers advanced import / export, evaluation and inspection methods. While the
package is called YDF, the wrapping code is sometimes lovingly called *PYDF*.

It is not a replacement for its sister project
YDF is the successor of
[Tensorflow Decision Forests](https://github.com/tensorflow/decision-forests)
(TF-DF). Instead, it complements TF-DF for use cases that cannot be solved
through the Keras API.
(TF-DF). TF-DF is still maintained, but new projects should choose YDF for
improved performance, better model quality and more features.

## Installation

Expand Down Expand Up @@ -41,15 +41,5 @@ loaded_model = ydf.load_model("my_model")

## Frequently Asked Questions

* **Is it PYDF or YDF?** The name of the library is simply ydf, and so is the
name of the corresponding Pip package. Internally, the team sometimes uses
the name *PYDF* because it fits so well.
* **What is the status of PYDF?** PYDF is currently in Alpha development. Most
parts already work well (training, evaluation, predicting, export), some new
features are yet to come. The API surface is mostly stable but may still
change without notice.
* **Where is the documentation for PYDF?** The documentation is
available on https://ydf.readthedocs.org.
* **How should I pronounce PYDF?** The preferred pronunciation is
"Py-dee-eff" / ˈpaɪˈdiˈɛf (IPA)
See the [FAQ](https://ydf.readthedocs.io/en/latest/faq/) in the documentation.

2 changes: 1 addition & 1 deletion yggdrasil_decision_forests/port/python/config/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from setuptools.command.install import install
from setuptools.dist import Distribution

_VERSION = "0.7.0"
_VERSION = "0.8.0"

with open("README.md", "r", encoding="utf-8") as fh:
long_description = fh.read()
Expand Down
3 changes: 1 addition & 2 deletions yggdrasil_decision_forests/port/python/dev_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,4 @@ jax; platform_machine != 'aarch64' and platform_system != 'Windows'
jaxlib; platform_machine != 'aarch64' and platform_system != 'Windows'
optax; platform_machine != 'aarch64' and platform_system != 'Windows' and python_version >= '3.9'
flatbuffers; platform_machine != 'aarch64' and platform_system != 'Windows' and python_version >= '3.12'
tensorflow-datasets; platform_machine != 'aarch64' and platform_system != 'Windows' and python_version >= '3.9'
grain
tensorflow-datasets; platform_machine != 'aarch64' and platform_system != 'Windows' and python_version >= '3.9'
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
cls
setlocal

set YDF_VERSION=0.7.0
set YDF_VERSION=0.8.0
set BAZEL=bazel.exe
set BAZEL_SH=C:\msys64\usr\bin\bash.exe
set BAZEL_FLAGS=--config=windows_cpp20 --config=windows_avx2
Expand Down
5 changes: 3 additions & 2 deletions yggdrasil_decision_forests/port/python/ydf/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,8 +372,9 @@ def create_vertical_dataset(
Args:
data: Source dataset. Supported formats: VerticalDataset, (typed) path, list
of (typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset,
PyGrain DataLoader and Dataset, dictionary of string to NumPy array or
lists. If the data is already a VerticalDataset, it is returned unchanged.
PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of
string to NumPy array or lists. If the data is already a VerticalDataset,
it is returned unchanged.
columns: If None, all columns are imported. The semantic of the columns is
determined automatically. Otherwise, if include_all_columns=False
(default) only the column listed in `columns` are imported. If
Expand Down
2 changes: 2 additions & 0 deletions yggdrasil_decision_forests/port/python/ydf/dataset/io/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@ py_test(
py_test(
name = "pygrain_io_test",
srcs = ["pygrain_io_test.py"],
# TODO: Figure out what to do with Pygrain support, since it does not work on MacOS.
tags = ["manual"], # Grain is not supported on MacOS
deps = [
":dataset_io_types",
":pygrain_io",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,9 @@
3. A Xarray dataset.
4. A YDF VerticalDataset created with `ydf.create_vertical_dataset`. This option is the most efficient when the same dataset is used multiple times.
5. A batched TensorFlow Dataset.
6. A PyGrain DataLoader or Dataset.
7. A typed path to a csv file e.g. "csv:/tmp/dataset.csv". See supported types below. The path can be sharded (e.g. "csv:/tmp/dataset@10") or globbed ("csv:/tmp/dataset*").
8. A list of typed paths e.g. ["csv:/tmp/data1.csv", "csv:/tmp/data2.csv"]. See supported types below.
6. A typed path to a csv file e.g. "csv:/tmp/dataset.csv". See supported types below. The path can be sharded (e.g. "csv:/tmp/dataset@10") or globbed ("csv:/tmp/dataset*").
7. A list of typed paths e.g. ["csv:/tmp/data1.csv", "csv:/tmp/data2.csv"]. See supported types below.
8. A PyGrain DataLoader or Dataset (experimental, Linux only).
The supported file formats and corresponding prefixes are:
- CSV file. prefix 'csv:'
Expand Down
19 changes: 10 additions & 9 deletions yggdrasil_decision_forests/port/python/ydf/model/generic_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -418,8 +418,9 @@ def predict(
Args:
data: Dataset. Supported formats: VerticalDataset, (typed) path, list of
(typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset,
PyGrain DataLoader and Dataset, dictionary of string to NumPy array or
lists. If the dataset contains the label column, that column is ignored.
PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of
string to NumPy array or lists. If the dataset contains the label
column, that column is ignored.
use_slow_engine: If true, uses the slow engine for making predictions. The
slow engine of YDF is an order of magnitude slower than the other
prediction engines. There exist very rare edge cases where predictions
Expand Down Expand Up @@ -506,8 +507,8 @@ def evaluate(
Args:
data: Dataset. Supported formats: VerticalDataset, (typed) path, list of
(typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset,
PyGrain DataLoader and Dataset, dictionary of string to NumPy array or
lists.
PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of
string to NumPy array or lists.
weighted: If true, the evaluation is weighted according to the training
weights. If false, the evaluation is non-weighted. b/351279797: Change
default to weights=True.
Expand Down Expand Up @@ -655,8 +656,8 @@ def analyze_prediction(
Args:
single_example: Example to explain. Supported formats: VerticalDataset,
(typed) path, list of (typed) paths, Pandas DataFrame, Xarray Dataset,
TensorFlow Dataset, PyGrain DataLoader and Dataset, dictionary of string
to NumPy array or lists.
TensorFlow Dataset, PyGrain DataLoader and Dataset (experimental, Linux
only), dictionary of string to NumPy array or lists.
Returns:
Prediction explanation.
Expand Down Expand Up @@ -714,8 +715,8 @@ def analyze(
Args:
data: Dataset. Supported formats: VerticalDataset, (typed) path, list of
(typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset,
PyGrain DataLoader and Dataset, dictionary of string to NumPy array or
lists.
PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of
string to NumPy array or lists.
sampling: Ratio of examples to use for the analysis. The analysis can be
expensive to compute. On large datasets, use a small sampling value e.g.
0.01.
Expand Down Expand Up @@ -1463,7 +1464,7 @@ def _build_evaluation_dataspec(
effective_dataspec = self._model.data_spec()

def find_existing_or_add_column(
semantic: Optional[data_spec_pb2.ColumnType],
semantic: Optional[Any],
name: Optional[str],
default_col_idx: int,
usage: str,
Expand Down
2 changes: 1 addition & 1 deletion yggdrasil_decision_forests/port/python/ydf/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.

version = "0.7.0"
version = "0.8.0"

0 comments on commit a89064f

Please sign in to comment.