Linear Regression, Model Management

prof-rossetti · Sep 15, 2024 · a907011 · a907011
1 parent 15d3a01
commit a907011
Show file tree

Hide file tree

Showing 6 changed files with 298 additions and 1 deletion.
diff --git a/docs/notes/predictive-modeling/ml-foundations/index.qmd b/docs/notes/predictive-modeling/ml-foundations/index.qmd
@@ -1 +1,4 @@
 # Machine Learning Foundations
+
+
+its about predicting something, x/y (target and features), supervised vs unsupervised (ground truth labels / test set or not), regression vs classification
diff --git a/docs/notes/predictive-modeling/model-management/grades-linear-regression/model.joblib b/docs/notes/predictive-modeling/model-management/grades-linear-regression/model.joblib
diff --git a/docs/notes/predictive-modeling/model-management/saving-loading.qmd b/docs/notes/predictive-modeling/model-management/saving-loading.qmd
@@ -1,6 +1,88 @@
+---
+#format:
+#  html:
+#    code-fold: show
+#    code-summary: "Show the code"
+---
+
 # Saving and Loading Models
 
+Let's consider the [linear regression](../regression/linear.qmd) model we have previously trained to predict grades given study hours:
+
+```{python}
+#| code-fold: show
+#| code-overflow: scroll
+
+from pandas import read_csv
+from sklearn.model_selection import train_test_split
+from sklearn.linear_model import LinearRegression
+
+request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/grades.csv"
+df = read_csv(request_url)
+df.dropna(inplace=True)
+
+x = df[["StudyHours"]]
+y = df["Grade"]
+x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
+
+model = LinearRegression()
+model.fit(x_train, y_train)
+```
+
+Although this particular model completed its training fairly quickly, it is normal for some more complicated models to take hours, days, or even weeks or months to train.
+
+Once the training script has completed (or the training notebook has restarted its session), we unfortunately lose access to the trained model. And we would need to re-train the model to make more predictions.
+
+To save time and avoid re-training the model each time we need to make predictions, we can train it once, and save the trained model with its learned weights. Then anytime we want to use the model again, we can load it from its saved state.
+
+
+
+
 ## Saving Trained Models
 
+To save and load models, we can use the `pickle` module, or the [`joblib` package](https://joblib.readthedocs.io/en/stable/) (preferred):
+
+```{python}
+import os
+import joblib
+
+# creating a directory to store the model:
+MODEL_DIRNAME = "grades-linear-regression"
+os.makedirs(MODEL_DIRNAME, exist_ok=True)
+
+# creating a filepath for the model in that directory:
+MODEL_FILEPATH = os.path.join(MODEL_DIRNAME, "model.joblib")
+
+# saving the model to the given filepath:
+joblib.dump(model, MODEL_FILEPATH)
+```
+
+:::{.callout-note title="Model Naming Conventions"}
+When using the `joblib` library and related tools to save and load models, it is a convention to call the saved model file "model.joblib" specifically. So to differentiate between models, this is why we customize the name of the directory where the model file is stored (in this case "grades-linear-regression").
+:::
 
 ## Loading Pre-trained Models
+
+Once we have saved a pre-trained model to a given filepath, we can load it from file:
+
+```{python}
+presaved_model = joblib.load(MODEL_FILEPATH)
+
+presaved_model
+```
+
+This model is the same as the one we previously trained, so we can use it to make predictions:
+
+```{python}
+from pandas import DataFrame
+
+x_new = DataFrame({"StudyHours": [0, 4, 8, 12, 16, 20]})
+
+presaved_model.predict(x_new)
+```
+
+Observe the predicted values from the loaded model are the same as before:
+
+```{python}
+model.predict(x_new)
+```
diff --git a/docs/notes/predictive-modeling/regression/linear.qmd b/docs/notes/predictive-modeling/regression/linear.qmd
@@ -1 +1,207 @@
 # Linear Regression
+
+## Data Loading
+
+Loading the data:
+
+```{python}
+from pandas import read_csv
+
+request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/grades.csv"
+df = read_csv(request_url)
+df
+```
+
+## Data Exploration
+
+Checking for Nulls:
+
+```{python}
+df["StudyHours"].isna().sum()
+```
+
+Dropping Nulls:
+
+```{python}
+df.dropna(inplace=True)
+df.tail()
+```
+
+Exploring relationship between variables:
+
+
+```{python}
+import plotly.express as px
+
+px.scatter(df, x="StudyHours", y="Grade", height=350,
+            title="Relationship between Study Hours and Grades",
+            trendline="ols", trendline_color_override="red",
+)
+```
+
+Checking for outliers:
+
+```{python}
+px.violin(df, x="StudyHours", box=True, points="all", height=350,
+    title="Distribution of Study Hours",
+)
+```
+
+```{python}
+px.violin(df, x="Grade", box=True, points="all", height=350,
+            title="Distribution of Grade"
+)
+```
+
+## Data Splitting
+
+### X/Y Split
+
+If we have a single feature variable, we reference it as a list of one, to keep the data in dataframe format (two-dimensional array) instead of a list or `Series` (one-dimensional array).
+
+```{python}
+#x = df["StudyHours"] # ValueError: Expected 2D array, got 1D array instead
+x = df[["StudyHours"]] # model wants x to be a matrix
+print(x.shape)
+
+y = df["Grade"]
+print(y.shape)
+```
+
+
+### Train Test Split
+
+```{python}
+from sklearn.model_selection import train_test_split
+
+x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
+print("TRAIN:", x_train.shape, y_train.shape)
+print("TEST:", x_test.shape, y_test.shape)
+```
+
+## Model Selection and Training
+
+Selecting a linear regression (OLS), and training it on the training data to learn the ideal weights:
+
+```{python}
+from sklearn.linear_model import LinearRegression
+
+model = LinearRegression()
+
+model.fit(x_train, y_train)
+```
+
+After the model is trained, we have access to the ideal weights (i.e. "coefficients"). There is one coefficient for each feature (in this case only one).
+
+```{python}
+print("COEFS:", model.coef_) # one for each feature
+print("Y INTERCEPT:", model.intercept_)
+```
+
+:::{.callout-note title="Note"}
+The convention with `sklearn` models is that any methods or properties ending with an underscore (`_`), like `coef_` and `intercept_` are only available after the model has been trained.
+:::
+
+When we have multiple coefficients, it will be helpful to wrap them in a `Series` to see which weights correspond with which features (although in this case there is only one feature):
+
+```{python}
+from pandas import Series
+
+coefs = Series(model.coef_, index=model.feature_names_in_)
+print(coefs)
+```
+
+
+The coefficients and y-intercept tell us the line of best fit:
+
+```{python}
+print("--------------")
+print(f"EQUATION FOR LINE OF BEST FIT:")
+print(f"y = ({round(model.coef_[0], 3)} * StudyHours) + {round(model.intercept_, 3)}")
+```
+
+## Model Predictions and Evaluation
+
+Alright, we trained the model, but how well does it do in making predictions?
+
+We use the trained model to make predictions on the unseen (test) data:
+
+```{python}
+y_pred = model.predict(x_test)
+print(y_pred)
+```
+
+We can then compare each of the predicted values against the actual known values:
+
+```{python}
+# get all rows from the original dataset that wound up in the test set:
+test_set = df.loc[x_test.index].copy()
+
+# create a column for the predictions:
+test_set["PredictedGrade"] = y_pred.round(1)
+
+# calculate error for each datapoint:
+test_set["Error"] = (y_pred - y_test).round(1)
+
+test_set.sort_values(by="StudyHours", ascending=False)
+```
+
+Plotting the errors on a graph:
+
+```{python}
+px.scatter(test_set, x="StudyHours", y=["Grade", "PredictedGrade"],
+           hover_data="Name", height=350,
+           title=f"Prediction errors (test set)",
+           labels={"value":""}
+)
+```
+
+To get a measure for how well the model did across the entire dataset, we can use any number of desired regression metrics (r-squared score, mean squared error, mean absolute error, root mean sqared error), to see how well the model does.
+
+
+It is possible for us to roll our own metrics:
+
+```{python}
+my_mae = test_set["Error"].abs().mean()
+print("MY MAE:", my_mae.round(3))
+```
+
+```{python}
+my_mse = (test_set["Error"] ** 2).mean()
+print("MY MSE:", my_mse.round(1))
+```
+
+However more commonly we will use metric functions from `sklearn.metrics` submodule:
+
+
+```{python}
+from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
+
+r2 = r2_score(y_test, y_pred)
+print("R^2:", round(r2, 3))
+
+mae = mean_absolute_error(y_test, y_pred)
+print("MAE:", round(mae, 3))
+
+mse = mean_squared_error(y_test, y_pred)
+print("MSE:", round(mse,3))
+```
+
+```{python}
+rmse = mse ** .5
+print("RMSE:", rmse.round(3))
+```
+
+## Inference
+
+Now that the model has been trained and deemed to have a sufficient performance, we can use it to make predictions on unseen data (sometimes called "inference"):
+
+```{python}
+from pandas import DataFrame
+
+x_new = DataFrame({"StudyHours": [0, 4, 8, 12, 16, 20]})
+
+model.predict(x_new)
+```
+
+Alright, we have trained a model and used it to make predictions!
diff --git a/docs/notes/predictive-modeling/regression/ridge-lasso.qmd b/docs/notes/predictive-modeling/regression/ridge-lasso.qmd
@@ -1,4 +1,4 @@
-# Advanced Regression Models
+# Weight-adjusting Regression Models
 
 ## Ridge
 

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -24,5 +24,11 @@ lxml # bs4 needs this to parse XML
 scipy
 
 
+# predictive modeling:
+scikit-learn
+joblib
+ucimlrepo
+
+
 
 #gspread==6.0.2
Original file line number	Diff line number	Diff line change
		@@ -1 +1,4 @@
		# Machine Learning Foundations


		its about predicting something, x/y (target and features), supervised vs unsupervised (ground truth labels / test set or not), regression vs classification