Skip to content

Commit

Permalink
Linear Regression, Model Management
Browse files Browse the repository at this point in the history
  • Loading branch information
s2t2 committed Sep 15, 2024
1 parent 15d3a01 commit a907011
Show file tree
Hide file tree
Showing 6 changed files with 298 additions and 1 deletion.
3 changes: 3 additions & 0 deletions docs/notes/predictive-modeling/ml-foundations/index.qmd
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
# Machine Learning Foundations


its about predicting something, x/y (target and features), supervised vs unsupervised (ground truth labels / test set or not), regression vs classification
Binary file not shown.
82 changes: 82 additions & 0 deletions docs/notes/predictive-modeling/model-management/saving-loading.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,88 @@
---
#format:
# html:
# code-fold: show
# code-summary: "Show the code"
---

# Saving and Loading Models

Let's consider the [linear regression](../regression/linear.qmd) model we have previously trained to predict grades given study hours:

```{python}
#| code-fold: show
#| code-overflow: scroll
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/grades.csv"
df = read_csv(request_url)
df.dropna(inplace=True)
x = df[["StudyHours"]]
y = df["Grade"]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
model = LinearRegression()
model.fit(x_train, y_train)
```

Although this particular model completed its training fairly quickly, it is normal for some more complicated models to take hours, days, or even weeks or months to train.

Once the training script has completed (or the training notebook has restarted its session), we unfortunately lose access to the trained model. And we would need to re-train the model to make more predictions.

To save time and avoid re-training the model each time we need to make predictions, we can train it once, and save the trained model with its learned weights. Then anytime we want to use the model again, we can load it from its saved state.




## Saving Trained Models

To save and load models, we can use the `pickle` module, or the [`joblib` package](https://joblib.readthedocs.io/en/stable/) (preferred):

```{python}
import os
import joblib
# creating a directory to store the model:
MODEL_DIRNAME = "grades-linear-regression"
os.makedirs(MODEL_DIRNAME, exist_ok=True)
# creating a filepath for the model in that directory:
MODEL_FILEPATH = os.path.join(MODEL_DIRNAME, "model.joblib")
# saving the model to the given filepath:
joblib.dump(model, MODEL_FILEPATH)
```

:::{.callout-note title="Model Naming Conventions"}
When using the `joblib` library and related tools to save and load models, it is a convention to call the saved model file "model.joblib" specifically. So to differentiate between models, this is why we customize the name of the directory where the model file is stored (in this case "grades-linear-regression").
:::

## Loading Pre-trained Models

Once we have saved a pre-trained model to a given filepath, we can load it from file:

```{python}
presaved_model = joblib.load(MODEL_FILEPATH)
presaved_model
```

This model is the same as the one we previously trained, so we can use it to make predictions:

```{python}
from pandas import DataFrame
x_new = DataFrame({"StudyHours": [0, 4, 8, 12, 16, 20]})
presaved_model.predict(x_new)
```

Observe the predicted values from the loaded model are the same as before:

```{python}
model.predict(x_new)
```
206 changes: 206 additions & 0 deletions docs/notes/predictive-modeling/regression/linear.qmd
Original file line number Diff line number Diff line change
@@ -1 +1,207 @@
# Linear Regression

## Data Loading

Loading the data:

```{python}
from pandas import read_csv
request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/grades.csv"
df = read_csv(request_url)
df
```

## Data Exploration

Checking for Nulls:

```{python}
df["StudyHours"].isna().sum()
```

Dropping Nulls:

```{python}
df.dropna(inplace=True)
df.tail()
```

Exploring relationship between variables:


```{python}
import plotly.express as px
px.scatter(df, x="StudyHours", y="Grade", height=350,
title="Relationship between Study Hours and Grades",
trendline="ols", trendline_color_override="red",
)
```

Checking for outliers:

```{python}
px.violin(df, x="StudyHours", box=True, points="all", height=350,
title="Distribution of Study Hours",
)
```

```{python}
px.violin(df, x="Grade", box=True, points="all", height=350,
title="Distribution of Grade"
)
```

## Data Splitting

### X/Y Split

If we have a single feature variable, we reference it as a list of one, to keep the data in dataframe format (two-dimensional array) instead of a list or `Series` (one-dimensional array).

```{python}
#x = df["StudyHours"] # ValueError: Expected 2D array, got 1D array instead
x = df[["StudyHours"]] # model wants x to be a matrix
print(x.shape)
y = df["Grade"]
print(y.shape)
```


### Train Test Split

```{python}
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)
```

## Model Selection and Training

Selecting a linear regression (OLS), and training it on the training data to learn the ideal weights:

```{python}
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)
```

After the model is trained, we have access to the ideal weights (i.e. "coefficients"). There is one coefficient for each feature (in this case only one).

```{python}
print("COEFS:", model.coef_) # one for each feature
print("Y INTERCEPT:", model.intercept_)
```

:::{.callout-note title="Note"}
The convention with `sklearn` models is that any methods or properties ending with an underscore (`_`), like `coef_` and `intercept_` are only available after the model has been trained.
:::

When we have multiple coefficients, it will be helpful to wrap them in a `Series` to see which weights correspond with which features (although in this case there is only one feature):

```{python}
from pandas import Series
coefs = Series(model.coef_, index=model.feature_names_in_)
print(coefs)
```


The coefficients and y-intercept tell us the line of best fit:

```{python}
print("--------------")
print(f"EQUATION FOR LINE OF BEST FIT:")
print(f"y = ({round(model.coef_[0], 3)} * StudyHours) + {round(model.intercept_, 3)}")
```

## Model Predictions and Evaluation

Alright, we trained the model, but how well does it do in making predictions?

We use the trained model to make predictions on the unseen (test) data:

```{python}
y_pred = model.predict(x_test)
print(y_pred)
```

We can then compare each of the predicted values against the actual known values:

```{python}
# get all rows from the original dataset that wound up in the test set:
test_set = df.loc[x_test.index].copy()
# create a column for the predictions:
test_set["PredictedGrade"] = y_pred.round(1)
# calculate error for each datapoint:
test_set["Error"] = (y_pred - y_test).round(1)
test_set.sort_values(by="StudyHours", ascending=False)
```

Plotting the errors on a graph:

```{python}
px.scatter(test_set, x="StudyHours", y=["Grade", "PredictedGrade"],
hover_data="Name", height=350,
title=f"Prediction errors (test set)",
labels={"value":""}
)
```

To get a measure for how well the model did across the entire dataset, we can use any number of desired regression metrics (r-squared score, mean squared error, mean absolute error, root mean sqared error), to see how well the model does.


It is possible for us to roll our own metrics:

```{python}
my_mae = test_set["Error"].abs().mean()
print("MY MAE:", my_mae.round(3))
```

```{python}
my_mse = (test_set["Error"] ** 2).mean()
print("MY MSE:", my_mse.round(1))
```

However more commonly we will use metric functions from `sklearn.metrics` submodule:


```{python}
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
r2 = r2_score(y_test, y_pred)
print("R^2:", round(r2, 3))
mae = mean_absolute_error(y_test, y_pred)
print("MAE:", round(mae, 3))
mse = mean_squared_error(y_test, y_pred)
print("MSE:", round(mse,3))
```

```{python}
rmse = mse ** .5
print("RMSE:", rmse.round(3))
```

## Inference

Now that the model has been trained and deemed to have a sufficient performance, we can use it to make predictions on unseen data (sometimes called "inference"):

```{python}
from pandas import DataFrame
x_new = DataFrame({"StudyHours": [0, 4, 8, 12, 16, 20]})
model.predict(x_new)
```

Alright, we have trained a model and used it to make predictions!
2 changes: 1 addition & 1 deletion docs/notes/predictive-modeling/regression/ridge-lasso.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Advanced Regression Models
# Weight-adjusting Regression Models

## Ridge

Expand Down
6 changes: 6 additions & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,11 @@ lxml # bs4 needs this to parse XML
scipy


# predictive modeling:
scikit-learn
joblib
ucimlrepo



#gspread==6.0.2

0 comments on commit a907011

Please sign in to comment.