Skip to content

Commit

Permalink
Data encoding
Browse files Browse the repository at this point in the history
  • Loading branch information
s2t2 committed Sep 17, 2024
1 parent 2910040 commit bbf8b0d
Show file tree
Hide file tree
Showing 12 changed files with 201 additions and 14 deletions.
Binary file added docs/images/5-fold-cross-validation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/bag-of-words.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/one-hot-encoding-diamond-color.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/ordinal-encoding-passenger-class.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/shuffled-train-test-split.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/shuffled-train-test-split.webp
Binary file not shown.
41 changes: 33 additions & 8 deletions docs/notes/predictive-modeling/ml-foundations/data-encoding.qmd
Original file line number Diff line number Diff line change
@@ -1,31 +1,56 @@
# Data Encoding


When preparing features (`x` values) for training machine learning models, the models need the data to be in numeric format, in order to perform math with the values.
When preparing features (`x` values) for training machine learning models, it's essential to convert the data into a numeric format. This is because most machine learning algorithms perform mathematical operations on the data, which require numerical inputs.

So if we have categorical or textual data, we will need to use a **data encoding** strategy to represent the data in a different way.

So if we have categorical or textual data, we need to use a **data encoding** strategy to represent the data in a different way, transforming them into numbers.

For categorical data, we'll use either an ordinal or one-hot encoding strategy, depending on whether there is a certain ordered relationship in the data or not. For time-series data, we can use time step encoding.
We'll choose an encoding strategy based on the nature of the data. For categorical data, we'll use either ordinal encoding if there's an inherent order among the categories, or one-hot encoding if no such order exists. For time-series data, we can apply time step encoding to represent the temporal sequence of observations.


## Ordinal Encoding for Categorical Data

If the data has an order about it, where one category means more or less than others, then we will convert the categories into a linear range of numbered values.
When the data has a natural order (i.e. where one category is "greater" or "less" than another), we use **ordinal encoding**. This involves converting the categories into a sequence of numbered values. For example, in a dataset containing ticket classes like first, second, and third, we can map these to integers (e.g. 1, 2, 3), maintaining the ordered relationship.


![Example of ordinal encoding (passenger ticket classes).](../../../images/ordinal-encoding-passenger-class.png)


With ordinal encoding, we start with a column of categories, and we wind up with a column of numbers.

## One-hot Encoding for Categorical Data

When the categorical data has no inherent order, we use **one-hot encoding**. In one-hot encoding, each unique category is represented as a binary vector, where only one element is 1, and the rest are 0.

For example, if we have five color categories (blue, green, red, purple, yellow), one-hot encoding will transform a single column of colors into five columns of binary values (0 or 1), where 1 represents presence of a given color, and 0 represents absence of that color.

![Example of one-hot encoding for categorical data (diamond colors).](../../../images/one-hot-encoding-diamond-color.png)

With ordinal encoding, we start with a column of categories, and we wind up with as many columns as there were unique values in the original column. This can potentially lead to a large number of features if there are a large number of categories present.


If the data is truly categorical, where there is no ordinal relationship present, we will perform "one-hot" encoding.
## One-hot Encoding for Textual Data

In natural language processing, we can split a sentence into words or tokens, and then note the presence or absence of a word in that sentence, using a one-hot encoding or related approach. Within the contents of natural language processing, this is called a "bag of words" vectorization approach.


In natural language processing (NLP), we can apply one-hot encoding to represent words or tokens in a sentence. This is typically referred to as the bag of words approach, where each unique word in a document is represented by a binary vector, denoting its presence or absence.


![Example of one-hot encoding for textual data.](../../../images/bag-of-words.png)


The bag of words approach is a simple count-based text embedding approach, however more advanced alternative approaches include Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings.

## Time Step Encoding for Time-series Data

When dealing with time-series data, we need to convert the dates to numbers.
When working with time-series data, it's important to encode dates in a way that preserves the temporal structure. A common approach is to use time step encoding, which involves assigning sequential integer values to each timestamp.

For example, if our data is recorded daily, we can assign the earliest date the value 1, the next day 2, and so on. This works well when observations are recorded at uniform time intervals, such as daily, monthly, or annual frequencies.

We can take advantage of the linear nature of time, and model dates as integer time steps. For example, starting at one for the earliest data point, and incrementing by one with each subsequent data point. This assumes our observations are over uniform time intervals (i.e. daily, monthly, annual frequency, etc.).

To create an ordered list of time step integers, we sort our dataset by date in ascending order, putting the earliest date first. Then we add a column of integers incrementing from one to the length of the dataset:
To create an ordered list of time step integers, we sort our dataset by date in ascending order, putting the earliest date first. Then we add a column of sequential integers:

```python
df.sort_values(by="date", ascending=True, inplace=True)
Expand Down
10 changes: 7 additions & 3 deletions docs/notes/predictive-modeling/ml-foundations/generalization.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Additional resources about generalization:
+ <https://developers.google.com/machine-learning/crash-course/overfitting/overfitting>


## Data Splitting Strategies
## Data Splitting

When building a machine learning model, it is important to evaluate its performance on data that the model has not seen during training. This ensures that the model is not overfitting and can generalize to new data.

Expand Down Expand Up @@ -112,7 +112,7 @@ The dataset is divided into several folds (commonly called **K-fold cross-valida
Cross validation is especially valuable when fine-tuning model hyperparameters, as it prevents overfitting to a specific validation set or the test set by providing a more generalized evaluation before the final test set assessment.


## Data Splitting Methods
## How to Split

This section provides some practical methods for splitting data in Python.

Expand All @@ -121,13 +121,17 @@ This section provides some practical methods for splitting data in Python.

In most machine learning problems, we typically perform a shuffled split, where the order of the data is randomized before partitioning it into training and testing sets. This helps ensure that the distribution in the training set closely resembles that of the test set, which reduces potential biases.

![Shuffled train/test split. Source: [Real Python](https://files.realpython.com/media/fig-1.c489adc748c8.png).](../../../images/shuffled-train-test-split.webp)


One common way of implementing a shuffled two-way split is to leverage the [`train_test_split` function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from `sklearn`:


```python
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=99)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)
```
Expand Down
149 changes: 149 additions & 0 deletions docs/notes/predictive-modeling/ml-foundations/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,152 @@


its about predicting something, x/y (target and features), supervised vs unsupervised (ground truth labels / test set or not), regression vs classification






## Predictive Modeling Process

Define the Problem:

+ What kind of task is this (i.e. regression vs classification)?
+ What is the target output variable (`y`) we want to predict?
+ What are the input features (`x`) we can use to make the prediction?
+ What kind of model(s) should we use to do the predicting?
+ What scoring metrics should we use?

Prepare the Data:

+ Checking for Nulls
+ Checking for Outliers
+ Examining Relationships
+ Data Scaling
+ Data Encoding
+ Data Splitting


Train and Evaluate the Model(s):
+ Train the model on the training dataset (x and y), so it knows the "right answers"
+ Evaluate the model on the testing dataset, which contains data it hasn't yet "seen"

Use Trained Model for Predictions and Forecasting












## Types of Machine Learning Tasks

**Supervised Learning**: when the data contains the "right answers" (a.k.a. "labels", or "target" prediction values). We share some of the right answers with the model, to help it learn what the desired output value is for a given set of inputs.

Supervised Tasks: Regression, Classification, etc.

**Unsupervised Learning**: when the data does not contain the "right answers" (i.e. lack of target prediction values). In these situations it is the model's responsibility to identify patterns in the given set of inputs.

Unsupervised Tasks: Clustering, Dimensionality Reduction, etc.

Reinforcement Learning:




## Supervised Learning Tasks

**Regression**: when the target variable we wish to predict is continuous - usually numeric.

Examples:

+ House Prices (in dollars)
+ Life Expectancy (in years)
+ Employee Salary (in dollars)
+ Distance to the Nearest Galaxy (in light years)

**Classification**: when the target variable we wish to predict is discrete - usually binary or categorical.

Examples:

+ Spam or Not (binary)
+ Success or Failure (binary)
+ Handwritten numeric digits (categorical)
+ 1-5 star rating scale (categorical???? )

## Unsupervised Learning Tasks

**Dimensionality Reduction**: ___________________


## Model Selection

Regression Models:

+ Linear Regression
+ Ridge Regression
+ Lasso Regression
+ etc.

Classification Models:

+ Logistic Regression (yes, this is a classification, not a regression model)
+ Decision Tree
+ Random Forest
+ etc.

Dimensionality Reduction Models:

+ Principal Component Analysis (PCA)
+ T-SNE
+ UMAP
+ etc.

## Metric Selection

Regression Metrics:

+ R^2 Score
+ Mean Squared Error
+ Mean Absolute Error
+ Root Mean Square Error
+ etc.

Classification Metrics:

+ Accuracy
+ Precision
+ Recall
+ F-1 Score
+ ROC AUC
+ etc.


## Data Preprocessing

Checking for Nulls: When we explore the data, we should pay attention to whether or not there are missing or null values. We might need to either drop rows with null values, or "impute" (a.k.a. fill-in) the null values. For example, we might choose to fill in some missing values using the mean or median of all other values in that column.

Checking for Outliers: We should also pay attention to whether or not there are any significant outliers, and consider dropping rows that contain these outliers, if it will help improve the performance of our model.


Examining Existing Relationships: We might use statistical techniques to examine the relationships between individual variables. This might help us select or exclude certain features as appropriate. If one column has a high correlation with the target column, perhaps we should select it as a feature. However, if the target column was directly derived from other columns, those columns should not be used as features. Also, if multiple feature columns are highly correlated with each other (collinearity), we could consider dropping the redundant ones.


Scaling Numeric Variables: Pay attention to the range of values for numeric variables. Some models may be more sensitive to the distance between the values, in which case we might choose to scale them into a new domain, for example between 0 and 1.

Encoding Categorical Variables: If we have categorical features, we may need to convert the category values to numeric space. For example, we might use "one-hot encoding" to create a matrix of 0/1 binary values for each word in a sentence, to represent the contents of the sentence in a way the model can understand.

Engineering New Features: Based on the problem definition and characteristics of the available features, it may sometimes be advantageous to create new features.

## Splitting

Generally we aim to split the original raw dataset into two different subsets: "train" and "test". We train the model on the training data ONLY. We use most of the data (~80% of rows) for training, and the remaining (~20%) for test.

Sometimes models can be too well fit to the training data and don't generalize well enough on unseen data. This is why we reserve the test dataset for evaluating the model's performance on data it has not yet seen. A more advanced version of this technique, called "Cross Validation", essentially uses many different combinations of test datasets to prevent overfitting.

We'll want to split our datasets using random sampling, to prevent training issues that may arise from similarities and relationships in the underlying data. Sometimes we will use a specific kind of sampling called stratification, which retains the same proportion of target class values. Stratification may be applicable for classification tasks.
10 changes: 9 additions & 1 deletion docs/notes/predictive-modeling/regression/ols.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,14 @@ print(results.summary())
The training results contain an r-squared score, however this represents the error for the training data. To get the real results of how the model generalizes to the test data, we will calculate the r-squared score and other metrics on the test results later.
:::

:::{.callout-note title="Interpreting P-values"}
In the OLS training results summary, the column labeled P>|t| represents the p-value for the corresponding t-statistic of each coefficient.

The t-statistic is used to test whether a particular coefficient is significantly different from zero. The p-value (P>|t|) tells you the probability of observing a t-statistic at least as extreme as the one calculated, assuming the null hypothesis is true (where the null hypothesis typically posits that the coefficient is zero, meaning the feature has no significant effect on the dependent variable).

+ A low p-value (typically less than 0.05) suggests that you can reject the null hypothesis, meaning the coefficient is statistically significant and likely has an impact on the dependent variable.
+ A high p-value (greater than 0.05) indicates that the coefficient is not statistically significant, implying that the feature may not contribute meaningfully to the model.
:::

The part of the training results we care about are the the learned weights (i.e. coefficients), which we use to arrive at the line of best fit:

Expand Down Expand Up @@ -259,7 +267,7 @@ fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Prediction'],
fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Prediction'],
mode='lines+markers',
name='Prediction (with CI)',
marker=dict(color='mediumturquoise', #size=10, symbol="x"
marker=dict(color='red', size=10, #symbol="x"
),
error_y=dict(type='data', symmetric=False,
array=chart_df['CI Upper'],
Expand Down
4 changes: 2 additions & 2 deletions docs/notes/predictive-modeling/regression/seasonality.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -317,7 +317,7 @@ x_monthly
```


Predict the residual (i.e. degree to which we will be over or under trend), based on which month it is?
Can we predict the residual (i.e. degree to which employment will be over or under trend), based on which month it is?

```{python}
y_monthly = df["residual"]
Expand All @@ -331,7 +331,7 @@ print(type(results_monthly))
print(results_monthly.summary())
```

Observe the coefficients tell us how each month contributes towards the regression residuals, in other words, for each month, to what degree does the model predict we will be above or below trend?
The coefficients tell us how each month contributes towards the regression residuals, in other words, for each month, to what degree does the model predict we will be above or below trend?

***Monthly Predictions of Residuals**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ In this case, we interpret the line of best fit to observe how much the populati
Remember in this dataset the population is expressed in thousands.
:::


## Model Prediction and Evaluation

We use the trained model to make predictions on the test set, and then calculate regression metrics to see how well the model is doing:
Expand Down

0 comments on commit bbf8b0d

Please sign in to comment.