Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citations #1

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
514 changes: 514 additions & 0 deletions docs/notes/autoregressive-models/arima.quarto_ipynb

Large diffs are not rendered by default.

532 changes: 532 additions & 0 deletions docs/notes/autoregressive-models/autocorrelation.quarto_ipynb

Large diffs are not rendered by default.

256 changes: 256 additions & 0 deletions docs/notes/classification/index.quarto_ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Classification\n",
"\n",
"**Classification** is a supervised learning task where the variable we are trying to predict is discrete, whether that is binary or categorical.\n",
"\n",
"Some examples of discrete data include:\n",
"\n",
" + Whether an email is spam or not (binary)\n",
" + Whether an outcome is successful or not (binary)\n",
" + Which of nine numeric digits is represented by some handwriting (categorical)\n",
" + Which of three families a given penguin is likely to be a member of (categorical)\n",
"\n",
"In this chapter, we will explore different classification models, and introduce key performance metrics used to evaluate the effectiveness of classification models.\n",
"\n",
"\n",
"\n",
"## Classification Models\n",
"\n",
"Classification Models in Python:\n",
"\n",
" + [`LogisticRegression`](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) from `sklearn` (NOTE: this is a classification model, not a regression model)\n",
" + [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from `sklearn`\n",
" + [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from `sklearn`\n",
" + [`XGBClassifier`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier) from `xgboost`\n",
" + etc.\n",
"\n",
"For text classification specifically, we will often use:\n",
"\n",
" + Naive Bayes Classifier, [`MultinomialNB`](https://scikit-learn.org/1.5/modules/generated/sklearn.naive_bayes.MultinomialNB.html) from `sklearn`\n",
"\n",
"## Classification Metrics\n",
"\n",
"Classification Metrics:\n",
"\n",
" + Accuracy\n",
" + Precision\n",
" + Recall\n",
" + F-1 Score\n",
" + ROC AUC\n",
" + etc.\n",
"\n",
"In addition to these metrics, we can use techniques such as the Confusion Matrix to evaluate classification results.\n",
"\n",
"Additional resources about Classification Metrics from Google ML Crash Course:\n",
"\n",
" + [Precision and Recall](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall)\n",
" + [AUC and ROC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)\n",
"\n",
"### Accuracy\n",
"\n",
"**Accuracy** measures the proportion of correctly classified instances among the total instances.\n",
"\n",
"\n",
"$$\n",
"\\text{Accuracy} = \\frac{\\text{True Positives} + \\text{True Negatives}}{\\text{Total Instances}}\n",
"$$\n",
"\n",
" + Pros: Provides a quick and simple measure of model performance.\n",
" + Cons: Can be misleading when the classes are imbalanced (e.g. rare event classification), as it does not differentiate between the types of errors (false positives vs. false negatives).\n",
"\n",
"\n",
"\n",
"### Precision, Recall, and F1 Score\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"**Precision** measures the accuracy of positive predictions, reflecting the proportion of true positives among all instances predicted as positive.\n",
"\n",
"\n",
"$$\n",
"\\text{Precision} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Positives}}\n",
"$$\n",
"\n",
" + Pros: Useful in scenarios where false positives are costly (e.g. spam detection).\n",
" + Cons: Does not account for false negatives, making it less informative in cases where missing positives is a concern.\n",
"\n",
"\n",
"**Recall**, or True Positive Rate, measures the model's ability to identify all positive instances, reflecting the proportion of true positives among actual positive instances.\n",
"\n",
"\n",
"$$\n",
"\\text{Recall} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Negatives}}\n",
"$$\n",
"\n",
" + Pros: Important in situations where missing positive cases is costly (e.g. disease diagnosis).\n",
" + Cons: Ignores false positives, potentially overemphasizing true positives at the cost of precision.\n",
"\n",
"\n",
"The **F-1 Score** is the harmonic mean of precision and recall, balancing the two metrics into a single score.\n",
"\n",
"$$\n",
"\\text{F-1 Score} = 2 \\cdot \\frac{\\text{Precision} \\times \\text{Recall}}{\\text{Precision} + \\text{Recall}}\n",
"$$\n",
"\n",
"\n",
" + Pros: Offers a balance between precision and recall, particularly useful when there is an uneven class distribution.\n",
" + Cons: Can be less interpretable alone, especially if more weight should be given to precision or recall individually.\n",
"\n",
"\n",
"\n",
"### ROC-AUC\n",
"\n",
"ROC stands for the Receiver Operating Characteristic.\n",
"\n",
"The ROC AUC score is the area under the ROC curve plotted with **True Positive Rate** on the y-axis and **False Positive Rate** on the x-axis, often computed numerically as there's no closed-form formula.\n",
"\n",
"\n",
"\n",
"\n",
"### Confusion Matrix\n",
"\n",
"In addition to using metrics to evaluate classification results, we can use additional techniques such as a confusion matrix to show how many observations were properly classified vs mis-classified.\n",
"\n",
"\n",
"\n",
"\n",
"## Classification Metrics in Python\n",
"\n",
"\n",
"For convenience, we will generally prefer to use classification metric functions from the [`sklearn.metrics` submodule](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics):\n",
"\n",
"\n",
"```python\n",
"from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
"\n",
"accy = accuracy_score(y_true, y_pred)\n",
"print(\"ACCY:\", round(accy,3))\n",
"\n",
"prec = precision_score(y_true, y_pred)\n",
"print(\"PRECISION:\", round(prec,3))\n",
"\n",
"rec = recall_score(y_true, y_pred)\n",
"print(\"RECALL:\", round(rec,3))\n",
"\n",
"f1 = f1_score(y_true, y_pred)\n",
"print(\"F1:\", round(f1,3))\n",
"```\n",
"\n",
"We also have access to the `classification_report` function which provides all of these metrics in a single report:\n",
"\n",
"```python\n",
"from sklearn.metrics import classification_report\n",
"\n",
"print(classification_report(y_true, y_pred))\n",
"```\n",
"\n",
"In addition to these metrics, we also have evaluation tools such as the `confusion_matrix` function:\n",
"\n",
"```python\n",
"from sklearn.metrics import confusion_matrix\n",
"\n",
"confusion_matrix(y_true, y_pred)\n",
"```\n",
"\n",
"When using these functions, we pass in the actual values (`y_true`), as well as the predicted values (`y_pred`), We take these values from the training set to arrive at training metrics, or from the test set to arrive at test metrics.\n",
"\n",
"\n",
"Here is a helper function for visualizing the results of a confusion matrix, using a color-coded heatmap:\n"
],
"id": "130c5876"
},
{
"cell_type": "code",
"metadata": {},
"source": [
"#| code-fold: show\n",
"\n",
"from sklearn.metrics import confusion_matrix\n",
"import plotly.express as px\n",
"\n",
"def plot_confusion_matrix(y_true, y_pred, height=450, showscale=False, title=None, subtitle=None):\n",
" # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html\n",
" # Confusion matrix whose i-th row and j-th column\n",
" # ... indicates the number of samples with\n",
" # ... true label being i-th class (ROW)\n",
" # ... and predicted label being j-th class (COLUMN)\n",
" cm = confusion_matrix(y_true, y_pred)\n",
"\n",
" class_names = sorted(y_test.unique().tolist())\n",
"\n",
" cm = confusion_matrix(y_test, y_pred, labels=class_names)\n",
"\n",
" title = title or \"Confusion Matrix\"\n",
" if subtitle:\n",
" title += f\"<br><sup>{subtitle}</sup>\"\n",
"\n",
" fig = px.imshow(cm, x=class_names, y=class_names, height=height,\n",
" labels={\"x\": \"Predicted\", \"y\": \"Actual\"},\n",
" color_continuous_scale=\"Blues\", text_auto=True,\n",
" )\n",
" fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})\n",
" fig.update_coloraxes(showscale=showscale)\n",
"\n",
" fig.show()\n"
],
"id": "42541ac5",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, the ROC-AUC score:\n",
"\n",
"```python\n",
"from sklearn.metrics import roc_auc_score\n",
"\n",
"# get \"logits\" (predicted probabilities for each class)\n",
"y_pred_proba = model.predict_proba(x_test)\n",
"\n",
"# for multi-class, pass all probas and use \"ovr\" (one vs rest)\n",
"roc_auc = roc_auc_score(y_test, y_pred_proba, multi_class=\"ovr\")\n",
"print(\"ROC-AUC:\", roc_auc)\n",
"```\n",
"\n",
"Helper function for ROC-AUC for binary or multi-class classification:\n",
"\n",
"```python\n",
"from sklearn.metrics import roc_auc_score\n",
"\n",
"def compute_roc_auc_score(y_test, y_pred_proba, is_multiclass=True):\n",
" \"\"\"NOTE: roc_auc_score uses average='macro' by default\"\"\"\n",
"\n",
" if is_multiclass:\n",
" # all classes (for multi-class), with \"one-versus-rest\" strategy\n",
" return roc_auc_score(y_true=y_test, y_score=y_pred_proba, multi_class=\"ovr\")\n",
" else:\n",
" # positive class (for binary classification)\n",
" y_pred_proba_pos = y_pred_proba[:,1]\n",
" return roc_auc_score(y_true=y_test, y_score=y_pred_proba_pos)\n",
"\n",
"\n",
"```"
],
"id": "f1e73e19"
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"language": "python",
"display_name": "Python 3 (ipykernel)",
"path": "/opt/anaconda3/share/jupyter/kernels/python3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading