diff --git a/documentation/public/docs/guide_feature_semantics.md b/documentation/public/docs/guide_feature_semantics.md new file mode 100644 index 00000000..de2ea738 --- /dev/null +++ b/documentation/public/docs/guide_feature_semantics.md @@ -0,0 +1,536 @@ +# Feature Semantics + +When training a model, YDF needs to understand how to interpret the feature +values in the training data. A feature might, for example, be a numerical +quantity, a category or a set of tags. The interpretation of a feature’s values +is called **feature semantic**. + +The semantic of a feature is related to, but different from, the feature’s +*representation*, that is, the (technical) data type of the feature. For +example, a feature represented by 64-bit integers might have a numerical +semantic or a categorical semantic. + +In basic cases, YDF detects the semantics of a feature automatically, so you +only need to check them after the training (e.g. use [model.describe()](/py_api/GenericModel/#ydf.GenericModel.describe)). If YDF +does not detect the correct semantic, you can manually override it. Using the +wrong semantic negatively impacts the training speed and quality of a model. +Also, YDF is not able to consume all the types of features. In such cases, +features need to be pre-preprocessed into a supported semantic. + +This guide explains the different semantics available in YDF, how to check / +select them, and gives recommendations on how to feed different types of +features into the model. + +This guide assumes basic familiarity with YDF, e.g. the +[Getting Started](/tutorial/getting_started) +tutorial. + +## Introduction: How to specify feature semantics + +Unless it is given additional information, YDF automatically determines the +feature semantics when training the model: + +```python +model = ydf.RandomForestLearner(label="label").train(ds) +# The "Dataspec" tab of the model description shows the feature semantics +# used for this model. +model.describe() +``` + +It is possible to override the feature semantics manually using the +`ydf.Semantic` enum: + +```python +model = ydf.RandomForestLearner( + features=[("f1", ydf.Semantic.NUMERICAL), ("f2", ydf.Semantic.CATEGORICAL)], + include_all_columns=True, # Also use not explicitly defined features. + label="label" +).train(ds) +model.describe() +``` + +Currently, YDF supports 5 input feature semantics. New semantics are added from +time to time following our research. The semantics are: + +* `ydf.Semantic.NUMERICAL` +* `ydf.Semantic.CATEGORICAL` +* `ydf.Semantic.BOOLEAN` +* `ydf.Semantic.CATEGORICAL_SET` +* `ydf.Semantic.DISCRETIZED_NUMERICAL` + +The next section will explain the individual semantics in more detail. + +## Feature semantics + +### ydf.Semantic.NUMERICAL + +NUMERICAL features represent quantities, +amounts, or more generally, any ordered values. For example, age (in years), +duration (in seconds), net worth (in dollars), number of requests (in +count), scores (in points), and even the median of a distribution, are +NUMERICAL features. + +YDF automatically recognizes integer and floating-point values as NUMERICAL. + +```python +# A dataset with 4 numerical features. +dataset = { +"age": np.array([1, 55, 24, 8]), +"number of cats": np.array([0, 10, 4, 2]), +"net worth": np.array([0.0, 123456.78, -4000.0, 315.42]), +"score": np.array([1.1, 5.2, math.nan, 1.2]), +} +``` + +### ydf.Semantic.CATEGORICAL + +CATEGORICAL features represent categories, +enum values, tags, or more generally, any unordered values. For example, +*species* (among cat, bird, or fish), *blood type* (among A, B, AB, O), +*country,* language, and *project status (in planning, in progress, done, +canceled)*.YDF automatically recognizes string features as CATEGORICAL. \ +\ +Additional considerations for categorical features are: + +* **Don’t bucket**: With Neural networks, numerical features are sometimes + bucketed into categorical brackets (e.g., 0-5, 5-10, 10-20). This is not + beneficial for YDF. If you can, feed directly the value as numerical. If + you only have the bucketed data, also feed it as a numerical feature. +* **Don’t use one-hot-encoding:** One-hot encoding for categorical + features consistently underperforms when using tree algorithms and + should not be used in YDF. +* **Preprocessing:** To avoid overfitting, YDF automatically replaces rare + categorical values with OOD (“out of dictionary”). This generally leads + to better models. This behavior is controlled by hyperparameters + `max_vocab_count` (to limit the number of categories) and + `min_vocab_frequency` (to prune rare categories). +* **Unknown Values:** During inference, any unknown categorical value is + treated as OOD which is distinct from a missing value. For example, + consider a model with a feature “species” that has values “cat”, “bird” + and “fish” in the training dataset. If, during model inference, an + instance has value “tiger” for the “species” feature, the model will + implicitly transform “tiger” to the OOD token. +* **Python Enums**: Numerical Python enums should generally have a + categorical semantic, but, as they are integers, are automatically + recognized as NUMERICAL. Enums should therefore be specified manually as + CATEGORICAL. + + ```python + # A dataset with 4 categorical features. + dataset = { + "species": np.array(["cat", "bird", "bird", "fish", "fish"]), + "country": np.array(["US", "US", "Switzerland", "India", "India"]), + "month": np.array([1, 1, 4, 6, 1]), # CATEGORICAL features can be integer. + "blood type": np.array(["A", "B", "AB", "B", ""]), # Missing values are empty. + } + model = ydf.RandomForestLearner( + label="blood type", + # Since integers are auto-detected as NUMERICAL, specify the semantic manually. + features=[("month", ydf.Semantic.CATEGORICAL)], + min_vocab_frequency=2, # Prune vocabulary items that appears only once + ).train(dataset) + # Check that all features and the label are CATEGORICAL. + # Check model.data_spec() column to see the feature's categories. + model.describe() + ``` + +### ydf.Semantic.BOOLEAN + +The value of a Boolean features can only be true, false, or +missing. Examples are “has subscribed”, “is spam”, “is in stock” etc. +Boolean features are a special case of categorical features. YDF +automatically recognizes boolean features as BOOLEAN for most dataset +formats. + +The label of the model can never be BOOLEAN. Note that binary classification +uses CATEGORICAL labels. + +```python +# A dataset with 3 boolean features. +dataset = { +"has subscribed": np.array([True, False, False, True]), +"spam": np.array([1,0,1,1], dtype=bool), # Ensure the dtype is not integer. +"happy": np.array([True, False, False, True]), +} +``` + +!!! warning + + Avoid IDs as features: Many datasets have features of type + "ID" / "identifier", "unique_hash", etc. that are (nearly) unique for each + example in the dataset. These features are not useful for training a machine + learning model, they slow down model training and might increase model size. + It is therefore important to remove these values from the dataset before + training. + +## Special semantics + +### ydf.Semantic.DISCRETIZED_NUMERICAL + +DISCRETIZED_NUMERICAL is not really a new semantic. Instead, it is used to +tell the learning algorithm to optimize training with a special +discretization algorithm. Any NUMERICAL feature can be configured as +DISCRETIZED_NUMERICAL. Training will be faster (generally ~2x) but it can +hurt the model quality. Setting all the NUMERICAL features as +DISCRETIZED_NUMERICAL is equivalent to setting hyperparameter +`detect_numerical_as_discretized_numerical=True`. + +```python +data = { +"age": np.array([1, 55, 24, 8]), +"net worth": np.array([0.0, 123456.78, -4000.0, 315.42]), +"weight": np.array([9, 63, 70, np.nan]), +} +model = ydf.RandomForestLearner( +label="weight", +discretize_numerical_columns=False, # Default +features=[("net worth", ydf.Semantic.DISCRETIZED_NUMERICAL)], +task=ydf.Task.REGRESSION, +).train(data) +# `net worth` is DISCRETIZED_NUMERICAL, `age` and `weight` are NUMERICAL. +model.describe() +``` + +### ydf.Semantic.CATEGORICAL_SET + +The value of a categorical-set feature is +a set of categorical values. In other words, while a +ydf.Semantic.CATEGORICAL can only have one value, a +ydf.Semantic.CATEGORICAL_SET feature can have none, one, or many categorical +values.. Use this for sets of discrete values, such as tokenized text or tag +sets (e.g. a webpage talks about {politics, elections, united-states}). + +When text features, the tokenization is important to consider. Splitting on +spaces works okay in English, but poorly in Chinese. For example, "A cat +sits on a tree" becomes {a, cat, on, sits, tree}. Using a more powerful +tokenizer might be better. Since a set does not encode position, {a, cat, +on, sits, tree} is equivalent to {a, tree, sits, on, cat}. One solution is +to use multi-grams (e.g., bi-grams) that encode consecutive works. For +example, the bi-grams in our example are {a_cat, cat_sit, sit_on, on_a, +a_tree}. \ + +* **Preprocessing:** As with CATEGORICAL features, YDF automatically + replaces rare categorical values with OOD (“out of dictionary”) for + CATEGORICAL_SET. This behavior is controlled by hyperparameters + `max_vocab_count` (to limit the number of categories) and + `min_vocab_frequency` (to prune rare categories). +* **Training speed**: CATEGORICAL_SET features are slower to train than + NUMERICAL or CATEGORICAL features. Don't use CATEGORICAL_SET instead of + CATEGORICAL features (the result will be the same, but the model will + train slowly). +* **Tokenization**: When using CATEGORICAL_SET for text features, the text + must be tokenized before it is fed to YDF. CSV files are tokenized by + whitespace automatically, see the section on CSV files for details. + +```python +# A dataset with 2 categorical set features and one categorical feature. +dataset = { + "title": [["Next", "week", "are", "us", "elections"], ["Reform", "started", "this", "month"], ["Funniest", "politics", "speeches"]], + "tags": [["politics", "election"], ["politics"], ["funny", "politics"]], + "interesting": ["yes", "yes", "no"], +} +model = ydf.RandomForestLearner( + label="interesting", + min_vocab_frequency=1, # Don't prune the categories. + features=[("title", ydf.Semantic.CATEGORICAL_SET), ("tags", ydf.Semantic.CATEGORICAL_SET)], +).train(dataset) +``` + +### Multi-dimensional features + +YDF supports constant-size vectors as +features. This is typically used when dealing with vector embeddings. For +example, consider a text feature `text` that is transformed (during +preprocessing) with a Universal Sentence Encoder model to a numerical vector +of 512 entries. This vector can be fed directly to YDF. + +YDF “unrolls” each entry of the vector to an individual feature. These +features are named `text.0_of_512`, `text.1_of_512`, etc. Note that all +vectors must have the exact same size - if the vectors have different sizes, +consider the CATEGORICAL_SET semantic. See +[here](/tutorial/multidimensional_feature) +for a more detailed example. + +```python +# A dataset with a two-dimensional numerical feature +# and a two-dimensional categorical feature. +dataset = { +"categorical_vector": np.array([["a", "b"], ["a", "c"], ["b", "c"]]), +"numerical_embeeding": np.array([[1, 2], [3, 4], [5, 6]]), +"label": np.array([1, 2, 1]), +} +model = ydf.RandomForestLearner( +label="label", +).train(dataset) +# `categorical_vector` is unrolled to two CATEGORICAL features: +# `categorical_vector.0_of_2` and `categorical_vector.1_of_2`. +# `numerical_embeeding` is unrolled to two NUMERICAL features: +# `numerical_embeeding.0_of_2` and `numerical_embeeding.1_of_2`. +model.describe() +``` + +!!! note + + ydf.Semantic.HASH is used internally only and cannot be used for decision + tree training. + + +## Not natively supported semantics + +There are some features that YDF cannot consume natively, but instead need to be +preprocessed. This section details the most common scenarios. + +### Timestamps + +Timestamps should be converted to the NUMERICAL semantic. A popular choice is to +decompose the timestamp into calendar features such as day of the week, week of +the month, hour of the day, etc.. Simply converting a timestamps into a +numerical unix time generally does not work well. + +!!! warning + + The presence of timestamps often indicates that the dataset is containing + time-series data (see next section). + + +### Time-series + +Time series datasets require advanced feature preprocessing for good model +quality. Check the +[special guide](/tutorial/time_sequences) +in the YDF documentation for more information. + +!!! warning + + Time series require careful modeling to prevent future leakage or poor model + quality. For complex problems, using a preprocessing tool such as + [Temporian](https://temporian.readthedocs.io) is recommended. + + +### Repeated proto messages + +Repeated proto messages must be “flattened” to individual features. Repeated +numerical entries, creating statistical features such as maximum, minimum, mean, +median, variance, … is useful. Repeated categorical entries can be transformed +to CATEGORICAL_SET features. + +### Images + +Decision Forest models are not state-of-the-art model architecture for image +processing. In many cases, exploring other model architectures (notably neural +networks) should be preferred. + +## Details by dataset format + +### Numpy + +**Scalar Types** + +The following table shows which **scalar Numpy data** types correspond to which +YDF semantics and which casts can be performed by YDF after manually specifying +a feature semantic. + +| Numpy\YDF | **NUMERICAL** [3] | **CATEGORICAL** | **BOOLEAN** | **CATEGORICAL SET** [6] | **DISCRETIZED NUMERICAL** | +|---|---|---|---|---|---| +| **int** [1] | Default | Cast [4] | No support | No support | Cast | +| **float** [2] | Default | No support | No support | No support | Cast | +| **bool** | Cast | Cast [5] | Default | No support | Cast | +| **str** | No support | Default | No support | No support | No support | +| **bytes** | No support | Default | No support | No support | No support | + +[1]: Includes unsigned integers + +[2]: float128 is not supported + +[3]: YDF internally casts numerical values to float32. + +[4]: Internally, the values are cast to string and sorted lexicographically. + +[5]: Internally, the values are cast to “false” and “true”, with “false” coming +first. + +[6]: Use type `object` for support for CATEGORICAL SET, see below. + +**Two-dimensional arrays (i.e. matrices)** + +YDF unrolls two-dimensional arrays(i.e. matrices) by column. + +**Object** + +YDF inspects features given as numpy arrays of type `object` and treats them +differently based on their content. + +If the array’s first element is a **scalar type**, YDF attempts to cast the +array to type `bytes` and treat it as CATEGORICAL. + +If the array’s first element is a **Python list or Numpy array** (irrespective +of dtype), YDF checks if it contains only lists (arrays) and fails otherwise. If +all lists have the same length, YDF attempts to cast the sub-lists (sub-arrays) +to type `np.bytes` and treat the entire feature as a matrix of CATEGORICAL +features. This matrix is then unrolled into individual features. + +If the sub-lists (sub-arrays) have different sizes, YDF attempts to cast the +sub-lists (sub-arrays) to type `np.bytes` and treat the entire feature with +semantic CATEGORICAL_SET. + +Any other types or combinations of types are not supported. + +**Missing values** + +Missing NUMERICAL values are `np.Nan`, missing CATEGORICAL values are empty +strings. Missing BOOLEAN or CATEGORICAL_SET values cannot be represented. + +### Python lists + +YDF can consume Python lists as features. However, automatic semantic detection +is not enabled for Python lists. Furthermore, multi-dimensional features (except +CATEGORICAL_SET) cannot be fed with Python lists. + +### CSV + +**Automatic Semantic detection** + +* Columns with only 0 and 1 are recognized as BOOLEAN. +* Columns with only numeric values are recognized as NUMERICAL. +* Other columns are recognized as CATEGORICAL (see below). +* Multidimensional features are not supported. + +Example code: + +```python +"""!cat mycsv.csv +num,bool,cat,catset,label +1.0,1,a,x y,1 +1.5,0,b,y z,2 +2.0,1,1,x y z,3""" +model = ydf.RandomForestLearner( +label="label", +# Note that CATEGORICAL_SET columns are tokenized by whitespace. +features=[("catset", ydf.Semantic.CATEGORICAL_SET)], +min_vocab_frequency=1, +include_all_columns=True, +).train("csv:mycsv.csv") +# Column "num" is NUMERICAL. +# Column "bool" is BOOLEAN +# Column "cat" is CATEGORICAL +# Column "catset" is CATEGORICAL_SET (as specified). +model.describe() +``` + +!!! warning + + Only the first 100000 rows are inspected to determine a column's type. Adapt + max_num_scanned_rows_to_infer_semantic to increase this value. + +**Automatic tokenization** + +When reading from CSV files, YDF tokenizes features with semantic +CATEGORICAL_SET by splitting along whitespace. + +Some YDF surfaces may automatically infer the type of string columns containing +whitespace as having semantic CATEGORICAL_SET. Note that the Python API of YDF +does not automatically infer type CATEGORICAL_SET, even if reading CSV files. + +**Missing values** + +Missing values are represented by string `na` or empty strings. + +### Avro + +**Automatic Semantic detection** + +Avro columns are typed, so a column’s type which YDF is used by default. + +* Boolean columns are recognized as BOOLEAN +* Long, Int, Float and Double columns are recognized as NUMERICAL +* Bytes and String columns are recognized as CATEGORICAL +* Array columns are either unrolled based on the type of data in the array. + Arrays of Bytes or Strings may furthermore have type CATEGORICAL_SET. +* Nested arrays are currently not supported but may be supported in the + future. + +## Advanced: How decision forests use feature semantics + +This section is not required for using YDF, but it might offer additional +context on the individual semantics and how they are used by common decision +forest learning algorithms. + +Recall that decision trees recursively split the dataset until a stopping +condition is reached. The feature semantics both dictate which types of splits +are considered, and how the algorithm finds the best split. + +* **NUMERICAL:** The learning algorithm creates splits in the data based on + thresholds (e.g., "age >= 30"). + + For even more powerful models, + [enable oblique splits](/guide_how_to_improve_model/#use-oblique-trees). + This allows YDF to learn splits that combine multiple numerical features + (e.g., "0.3 \* age + 0.7 \* income >= 50"). This is particularly helpful for + smaller datasets but requires more training time. + +* **CATEGORICAL**: The learning algorithm creates splits by grouping values + into sets (e.g., "country in {USA, Switzerland, Luxemburg}"). This is more + expressive than numerical splits but computationally more expensive. + +* **BOOLEAN:** This semantic provides a memory- and speed-optimized way to + handle boolean data compared to using semantics NUMERICAL or CATEGORICAL. + + **CATEGORICAL_SET:** The learning algorithm creates splits based on set + intersections (e.g., "`{a, cat, on, sits, tree}` intersects `{cat, dog, + bird}`"). This semantic is computationally expensive and can be slow to + train. For in-depth information about categorical sets, see + [Guillame-Bert et al., 2020.](https://arxiv.org/abs/2009.09991) + +### Missing Values + +YDF effectively handles missing values, which are represented differently for +each semantic. During training, YDF uses global imputation: + +* **NUMERICAL and DISCRETIZED_NUMERICAL:** Missing values are replaced with + the mean of the feature. +* **CATEGORICAL and BOOLEAN:** Missing values are replaced with the most + frequent value. +* **CATEGORICAL_SET:** Missing values are always routed to the negative branch + of a split. + +If the hyperparameter `allow_na_conditions` is enabled, the learning algorithm +can also create splits of the form “feature is NA”. Note that this is usually +not necessary: Global imputation replaces missing values with a “special” value, +which is quickly learned by the algorithm. + +!!! note + + Missing values are not allowed in the label column. + +### Label column semantic + +The semantic of the label column is determined by the model task: + +* **REGRESSION:** Requires NUMERICAL labels. +* **CLASSIFICATION:** Requires CATEGORICAL labels. +* **RANKING:** Requires NUMERICAL labels with integer values. +* **CATEGORICAL_UPLIFT:** Requires NUMERICAL labels. +* **NUMERICAL_UPLIFT:** Requires NUMERICAL labels. + +Changing the task changes the model's loss function and output, resulting in a +fundamentally different model. If the label column cannot be interpreted with +the chosen task’s semantic, model training fails. + +### The data spec + +Internally, YDF models store information about the feature semantics in the +"data spec". A model's data spec is a proto message that can be displayed with +`model.data_spec()`. The data spec also contains information about +the feature values seen during training. Some of this information may be used +during training and inference. In particular, the data spec stores the following +information: + +* For NUMERICAL and DISCRETIZED_NUMERICAL features, statistical information + and, if relevant, bucketization. +* For CATEGORICAL and CATEGORICAL_SET features, the values seen during + training and their frequency. +* For BOOLEAN values, the frequency of true and false values seen during + training. + +A summary of the data spec is shown in `model.describe()`. The raw data spec +is mainly useful for debugging issues with YDF. diff --git a/documentation/public/docs/style/extra.css b/documentation/public/docs/style/extra.css index 08f54733..9af9fbd8 100644 --- a/documentation/public/docs/style/extra.css +++ b/documentation/public/docs/style/extra.css @@ -169,7 +169,3 @@ h1#_1 { background-color: transparent; white-space: pre-wrap; } - -.md-sidebar--secondary { - display: none !important; -} \ No newline at end of file diff --git a/documentation/public/mkdocs.yml b/documentation/public/mkdocs.yml index 9b39b448..01a9ae07 100644 --- a/documentation/public/mkdocs.yml +++ b/documentation/public/mkdocs.yml @@ -47,6 +47,12 @@ nav: - 📖 Glossary: glossary.md - 🤸 For Googlers: http://go/ydf - ✒️ Blog: blog/index.md + - Guides: + - What are decision forests?: https://developers.google.com/machine-learning/decision-forests + - How to define model features: guide_feature_semantics.md + - How to improve a model?: guide_how_to_improve_model.md + - How to train a model faster?: guide_how_to_improve_learner.md + - Migrating from TF-DF: tutorial/migrating_to_ydf.ipynb - Tasks solved by YDF: - Classification: tutorial/classification.ipynb - Regression: tutorial/regression.ipynb @@ -87,11 +93,6 @@ nav: - Evaluation: - Train & test: tutorial/train_and_test.ipynb - Cross-validation: tutorial/cross_validation.ipynb - - Guides: - - What are decision forests?: https://developers.google.com/machine-learning/decision-forests - - How to improve a model?: guide_how_to_improve_model.md - - How to train a model faster?: guide_how_to_improve_learner.md - - Migrating from TF-DF: tutorial/migrating_to_ydf.ipynb - Advanced: - Inspecting trees: tutorial/inspecting_trees.ipynb - Editing trees: tutorial/editing_trees.ipynb @@ -130,6 +131,8 @@ markdown_extensions: - pymdownx.details - pymdownx.arithmatex: generic: true + - toc: + permalink: true extra_javascript: - js/mathjax.js diff --git a/documentation/public/readme.md b/documentation/public/readme.md index e29a2197..8bcd839e 100644 --- a/documentation/public/readme.md +++ b/documentation/public/readme.md @@ -7,5 +7,5 @@ python3 -m pip install -r third_party/yggdrasil_decision_forests/documentation/public/requirements.txt # Start a http server with the documentation -(cd third_party/yggdrasil_decision_forests && mkdocs serve -a localhost:8888 -f documentation/public/mkdocs.yml) +(cd third_party/yggdrasil_decision_forests && mkdocs serve -a localhost:8889 -f documentation/public/mkdocs.yml) ```