prof-rossetti · EGG-Danking · Nov 19, 2024 · Nov 19, 2024 · Nov 26, 2024
diff --git a/docs/notes/autoregressive-models/arima.quarto_ipynb b/docs/notes/autoregressive-models/arima.quarto_ipynb
diff --git a/docs/notes/autoregressive-models/autocorrelation.quarto_ipynb b/docs/notes/autoregressive-models/autocorrelation.quarto_ipynb
diff --git a/docs/notes/classification/index.quarto_ipynb b/docs/notes/classification/index.quarto_ipynb
@@ -0,0 +1,256 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Classification\n",
+        "\n",
+        "**Classification** is a supervised learning task where the variable we are trying to predict is discrete, whether that is binary or categorical.\n",
+        "\n",
+        "Some examples of discrete data include:\n",
+        "\n",
+        "  + Whether an email is spam or not (binary)\n",
+        "  + Whether an outcome is successful or not (binary)\n",
+        "  + Which of nine numeric digits is represented by some handwriting (categorical)\n",
+        "  + Which of three families a given penguin is likely to be a member of (categorical)\n",
+        "\n",
+        "In this chapter, we will explore different classification models, and introduce key performance metrics used to evaluate the effectiveness of classification models.\n",
+        "\n",
+        "\n",
+        "\n",
+        "## Classification Models\n",
+        "\n",
+        "Classification Models in Python:\n",
+        "\n",
+        "  + [`LogisticRegression`](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) from `sklearn` (NOTE: this is a classification model, not a regression model)\n",
+        "  + [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from `sklearn`\n",
+        "  + [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from `sklearn`\n",
+        "  + [`XGBClassifier`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier) from `xgboost`\n",
+        "  + etc.\n",
+        "\n",
+        "For text classification specifically, we will often use:\n",
+        "\n",
+        "  + Naive Bayes Classifier, [`MultinomialNB`](https://scikit-learn.org/1.5/modules/generated/sklearn.naive_bayes.MultinomialNB.html) from `sklearn`\n",
+        "\n",
+        "## Classification Metrics\n",
+        "\n",
+        "Classification Metrics:\n",
+        "\n",
+        "  + Accuracy\n",
+        "  + Precision\n",
+        "  + Recall\n",
+        "  + F-1 Score\n",
+        "  + ROC AUC\n",
+        "  + etc.\n",
+        "\n",
+        "In addition to these metrics, we can use techniques such as the Confusion Matrix to evaluate classification results.\n",
+        "\n",
+        "Additional resources about Classification Metrics from Google ML Crash Course:\n",
+        "\n",
+        "  + [Precision and Recall](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall)\n",
+        "  + [AUC and ROC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)\n",
+        "\n",
+        "### Accuracy\n",
+        "\n",
+        "**Accuracy** measures the proportion of correctly classified instances among the total instances.\n",
+        "\n",
+        "\n",
+        "$$\n",
+        "\\text{Accuracy} = \\frac{\\text{True Positives} + \\text{True Negatives}}{\\text{Total Instances}}\n",
+        "$$\n",
+        "\n",
+        "  + Pros: Provides a quick and simple measure of model performance.\n",
+        "  + Cons: Can be misleading when the classes are imbalanced (e.g. rare event classification), as it does not differentiate between the types of errors (false positives vs. false negatives).\n",
+        "\n",
+        "\n",
+        "\n",
+        "### Precision, Recall, and F1 Score\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n",
+        "**Precision** measures the accuracy of positive predictions, reflecting the proportion of true positives among all instances predicted as positive.\n",
+        "\n",
+        "\n",
+        "$$\n",
+        "\\text{Precision} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Positives}}\n",
+        "$$\n",
+        "\n",
+        "  + Pros: Useful in scenarios where false positives are costly (e.g. spam detection).\n",
+        "  + Cons: Does not account for false negatives, making it less informative in cases where missing positives is a concern.\n",
+        "\n",
+        "\n",
+        "**Recall**, or True Positive Rate, measures the model's ability to identify all positive instances, reflecting the proportion of true positives among actual positive instances.\n",
+        "\n",
+        "\n",
+        "$$\n",
+        "\\text{Recall} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Negatives}}\n",
+        "$$\n",
+        "\n",
+        "  + Pros: Important in situations where missing positive cases is costly (e.g. disease diagnosis).\n",
+        "  + Cons: Ignores false positives, potentially overemphasizing true positives at the cost of precision.\n",
+        "\n",
+        "\n",
+        "The **F-1 Score** is the harmonic mean of precision and recall, balancing the two metrics into a single score.\n",
+        "\n",
+        "$$\n",
+        "\\text{F-1 Score} = 2 \\cdot \\frac{\\text{Precision} \\times \\text{Recall}}{\\text{Precision} + \\text{Recall}}\n",
+        "$$\n",
+        "\n",
+        "\n",
+        "  + Pros: Offers a balance between precision and recall, particularly useful when there is an uneven class distribution.\n",
+        "  + Cons: Can be less interpretable alone, especially if more weight should be given to precision or recall individually.\n",
+        "\n",
+        "\n",
+        "\n",
+        "### ROC-AUC\n",
+        "\n",
+        "ROC stands for the Receiver Operating Characteristic.\n",
+        "\n",
+        "The ROC AUC score is the area under the ROC curve plotted with **True Positive Rate** on the y-axis and **False Positive Rate** on the x-axis, often computed numerically as there's no closed-form formula.\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n",
+        "### Confusion Matrix\n",
+        "\n",
+        "In addition to using metrics to evaluate classification results, we can use additional techniques such as a confusion matrix to show how many observations were properly classified vs mis-classified.\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n",
+        "## Classification Metrics in Python\n",
+        "\n",
+        "\n",
+        "For convenience, we will generally prefer to use classification metric functions from the [`sklearn.metrics` submodule](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics):\n",
+        "\n",
+        "\n",
+        "```python\n",
+        "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
+        "\n",
+        "accy = accuracy_score(y_true, y_pred)\n",
+        "print(\"ACCY:\", round(accy,3))\n",
+        "\n",
+        "prec = precision_score(y_true, y_pred)\n",
+        "print(\"PRECISION:\", round(prec,3))\n",
+        "\n",
+        "rec = recall_score(y_true, y_pred)\n",
+        "print(\"RECALL:\", round(rec,3))\n",
+        "\n",
+        "f1 = f1_score(y_true, y_pred)\n",
+        "print(\"F1:\", round(f1,3))\n",
+        "```\n",
+        "\n",
+        "We also have access to the `classification_report` function which provides all of these metrics in a single report:\n",
+        "\n",
+        "```python\n",
+        "from sklearn.metrics import classification_report\n",
+        "\n",
+        "print(classification_report(y_true, y_pred))\n",
+        "```\n",
+        "\n",
+        "In addition to these metrics, we also have evaluation tools such as the `confusion_matrix` function:\n",
+        "\n",
+        "```python\n",
+        "from sklearn.metrics import confusion_matrix\n",
+        "\n",
+        "confusion_matrix(y_true, y_pred)\n",
+        "```\n",
+        "\n",
+        "When using these functions, we pass in the actual values (`y_true`), as well as the predicted values (`y_pred`), We take these values from the training set to arrive at training metrics, or from the test set to arrive at test metrics.\n",
+        "\n",
+        "\n",
+        "Here is a helper function for visualizing the results of a confusion matrix, using a color-coded heatmap:\n"
+      ],
+      "id": "130c5876"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "#| code-fold: show\n",
+        "\n",
+        "from sklearn.metrics import confusion_matrix\n",
+        "import plotly.express as px\n",
+        "\n",
+        "def plot_confusion_matrix(y_true, y_pred, height=450, showscale=False, title=None, subtitle=None):\n",
+        "    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html\n",
+        "    # Confusion matrix whose i-th row and j-th column\n",
+        "    # ... indicates the number of samples with\n",
+        "    # ... true label being i-th class (ROW)\n",
+        "    # ... and predicted label being j-th class (COLUMN)\n",
+        "    cm = confusion_matrix(y_true, y_pred)\n",
+        "\n",
+        "    class_names = sorted(y_test.unique().tolist())\n",
+        "\n",
+        "    cm = confusion_matrix(y_test, y_pred, labels=class_names)\n",
+        "\n",
+        "    title = title or \"Confusion Matrix\"\n",
+        "    if subtitle:\n",
+        "        title += f\"<br><sup>{subtitle}</sup>\"\n",
+        "\n",
+        "    fig = px.imshow(cm, x=class_names, y=class_names, height=height,\n",
+        "                    labels={\"x\": \"Predicted\", \"y\": \"Actual\"},\n",
+        "                    color_continuous_scale=\"Blues\", text_auto=True,\n",
+        "    )\n",
+        "    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})\n",
+        "    fig.update_coloraxes(showscale=showscale)\n",
+        "\n",
+        "    fig.show()\n"
+      ],
+      "id": "42541ac5",
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Finally, the ROC-AUC score:\n",
+        "\n",
+        "```python\n",
+        "from sklearn.metrics import roc_auc_score\n",
+        "\n",
+        "# get \"logits\" (predicted probabilities for each class)\n",
+        "y_pred_proba = model.predict_proba(x_test)\n",
+        "\n",
+        "# for multi-class, pass all probas and use \"ovr\" (one vs rest)\n",
+        "roc_auc = roc_auc_score(y_test, y_pred_proba, multi_class=\"ovr\")\n",
+        "print(\"ROC-AUC:\", roc_auc)\n",
+        "```\n",
+        "\n",
+        "Helper function for ROC-AUC for binary or multi-class classification:\n",
+        "\n",
+        "```python\n",
+        "from sklearn.metrics import roc_auc_score\n",
+        "\n",
+        "def compute_roc_auc_score(y_test, y_pred_proba, is_multiclass=True):\n",
+        "    \"\"\"NOTE: roc_auc_score uses average='macro' by default\"\"\"\n",
+        "\n",
+        "    if is_multiclass:\n",
+        "        # all classes (for multi-class), with \"one-versus-rest\" strategy\n",
+        "        return roc_auc_score(y_true=y_test, y_score=y_pred_proba, multi_class=\"ovr\")\n",
+        "    else:\n",
+        "        # positive class (for binary classification)\n",
+        "        y_pred_proba_pos = y_pred_proba[:,1]\n",
+        "        return roc_auc_score(y_true=y_test, y_score=y_pred_proba_pos)\n",
+        "\n",
+        "\n",
+        "```"
+      ],
+      "id": "f1e73e19"
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "name": "python3",
+      "language": "python",
+      "display_name": "Python 3 (ipykernel)",
+      "path": "/opt/anaconda3/share/jupyter/kernels/python3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}