delete kernel

octo-technology · Sep 20, 2023 · b59e445 · b59e445
1 parent 9d8801f
commit b59e445
Show file tree

Hide file tree

Showing 2 changed files with 124 additions and 449 deletions.
diff --git a/notebook/titanic.ipynb b/notebook/titanic.ipynb
@@ -3,7 +3,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "b08819f5-01fc-640e-c570-d7c370f34014"
+    "_cell_guid": "b08819f5-01fc-640e-c570-d7c370f34014",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "This document is a thorough overview of my process for building a predictive model for Kaggle's Titanic competition. I will provide all my essential steps in this model as well as the reasoning behind each decision I made. This model achieves a score of 82.78%, which is in the top 3% of all submissions at the time of this writing. This is a great introductory modeling exercise due to the simple nature of the data, yet there is still a lot to be gleaned from following a process that ultimately yields a high score.\n",
@@ -15,7 +18,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "381ac8a1-9f18-f1bd-67ed-f165373c8d0f"
+    "_cell_guid": "381ac8a1-9f18-f1bd-67ed-f165373c8d0f",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "### The Problem"
@@ -24,7 +30,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "76f20215-7870-47db-e2f0-c253a43aa8db"
+    "_cell_guid": "76f20215-7870-47db-e2f0-c253a43aa8db",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "We are given information about a subset of the Titanic population and asked to build a predictive model that tells us whether or not a given passenger survived the shipwreck. We are given 10 basic explanatory variables, including passenger gender, age, and price of fare, among others. More details about the competition can be found on the Kaggle site, [here](https://www.kaggle.com/c/titanic). This is a classic binary classification problem, and we will be implementing a random forest classifer."
@@ -33,7 +42,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "0941247d-c2f1-a753-e4ff-583a88f2e7dc"
+    "_cell_guid": "0941247d-c2f1-a753-e4ff-583a88f2e7dc",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "### Exploratory Data Analysis"
@@ -42,7 +54,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "e99bef63-bd89-42e9-6897-aba6337b2afb"
+    "_cell_guid": "e99bef63-bd89-42e9-6897-aba6337b2afb",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "The goal of this section is to gain an understanding of our data in order to inform what we do in the feature engineering section.  \n",
@@ -54,7 +69,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_cell_guid": "9ae4a31b-44ce-72b7-375b-1376bcc81142"
+    "_cell_guid": "9ae4a31b-44ce-72b7-375b-1376bcc81142",
+    "pycharm": {
+     "name": "#%%\n"
+    }
    },
    "outputs": [],
    "source": [
@@ -66,7 +84,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "6e6e109b-469e-4210-18a9-59d56448fddc"
+    "_cell_guid": "6e6e109b-469e-4210-18a9-59d56448fddc",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "We then load the data, which we have downloaded from the Kaggle website ([here](https://www.kaggle.com/c/titanic/data) is a link to the data if you need it)."
@@ -76,7 +97,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_cell_guid": "e117a178-539a-d880-ab8c-5306d6d671f0"
+    "_cell_guid": "e117a178-539a-d880-ab8c-5306d6d671f0",
+    "pycharm": {
+     "name": "#%%\n"
+    }
    },
    "outputs": [],
    "source": [
@@ -87,7 +111,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "d67e8e9f-d809-7138-9ebc-c8c9fa1fb88c"
+    "_cell_guid": "d67e8e9f-d809-7138-9ebc-c8c9fa1fb88c",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "First, let's take a look at the summary of all the data. Immediately, we note that `Age`, `Cabin`, and `Embarked` have nulls that we'll have to deal with. "
@@ -97,7 +124,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_cell_guid": "166004fb-0092-7fb9-f890-1b764a7f6da9"
+    "_cell_guid": "166004fb-0092-7fb9-f890-1b764a7f6da9",
+    "pycharm": {
+     "name": "#%%\n"
+    }
    },
    "outputs": [],
    "source": [
@@ -107,7 +137,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "928cc3b0-b08d-cd49-0c0a-993f51cc5070"
+    "_cell_guid": "928cc3b0-b08d-cd49-0c0a-993f51cc5070",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "It appears that we can drop the `PassengerId` column, since it is merely an index. Note, however, that some people have reportedly improved their score with the `PassengerId` column. However, my cursory attempt to do so did not yield positive results, and moreover I would like to mimic a real-life scenario, where an index of a dataset generally has no correlation with the target variable."
@@ -117,7 +150,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_cell_guid": "2fee872c-6233-57b1-d6a8-d017ef15edbd"
+    "_cell_guid": "2fee872c-6233-57b1-d6a8-d017ef15edbd",
+    "pycharm": {
+     "name": "#%%\n"
+    }
    },
    "outputs": [],
    "source": [
@@ -127,7 +163,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "792b218a-b4c0-7a4a-2442-f375deee3581"
+    "_cell_guid": "792b218a-b4c0-7a4a-2442-f375deee3581",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "### Feature Engineering"
@@ -136,7 +175,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "a69433b7-96b7-589b-4481-180206c1e5b2"
+    "_cell_guid": "a69433b7-96b7-589b-4481-180206c1e5b2",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "Having done our cursory exploration of the variables, we now have a pretty good idea of how we want to transform our variables in preparation for our final dataset. We will perform our feature engineering through a series of helper functions that each serve a specific purpose. "
@@ -159,7 +201,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
    "outputs": [],
    "source": [
     "from src.feature_engineering import *"
@@ -168,7 +214,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "8d1efbbb-e0dd-f3d4-acd3-ef9eb5c396e2"
+    "_cell_guid": "8d1efbbb-e0dd-f3d4-acd3-ef9eb5c396e2",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "Having built our helper functions, we can now execute them in order to build our dataset that will be used in the model:a"
@@ -178,7 +227,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_cell_guid": "d004f91d-1c6b-e281-6e4c-45b44eadbcca"
+    "_cell_guid": "d004f91d-1c6b-e281-6e4c-45b44eadbcca",
+    "pycharm": {
+     "name": "#%%\n"
+    }
    },
    "outputs": [],
    "source": [
@@ -192,7 +244,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "35d843bc-3607-f11a-55f9-2828cf5eb91e"
+    "_cell_guid": "35d843bc-3607-f11a-55f9-2828cf5eb91e",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "We can see that our final dataset has 55 columns, composed of our target column and 54 predictor variables. Although highly dimensional datasets can result in high variance, I think we should be fine here. "
@@ -202,7 +257,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_cell_guid": "09391302-b621-4730-7589-7eb017286e7f"
+    "_cell_guid": "09391302-b621-4730-7589-7eb017286e7f",
+    "pycharm": {
+     "name": "#%%\n"
+    }
    },
    "outputs": [],
    "source": [
@@ -212,7 +270,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "1066e65e-e578-e896-5c38-1457a947ec6f"
+    "_cell_guid": "1066e65e-e578-e896-5c38-1457a947ec6f",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "### Hyperparameter Tuning"
@@ -233,7 +294,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "7f6c54fa-033e-075f-0e86-c9c0b469a03b"
+    "_cell_guid": "7f6c54fa-033e-075f-0e86-c9c0b469a03b",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "from sklearn.model_selection import GridSearchCV  \n",
@@ -294,7 +358,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "e494ad2b-92e3-782f-13c1-f53a86602298"
+    "_cell_guid": "e494ad2b-92e3-782f-13c1-f53a86602298",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "We are now ready to fit our model using the optimal hyperparameters. The out-of-bag score can give us an unbiased estimate of the model accuracy, and we can see that the score is 83.73%, which is only a little higher than our final leaderboard score."
@@ -304,7 +371,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_cell_guid": "5593980a-4145-9594-299c-f4d1a9f01970"
+    "_cell_guid": "5593980a-4145-9594-299c-f4d1a9f01970",
+    "pycharm": {
+     "name": "#%%\n"
+    }
    },
    "outputs": [],
    "source": [
@@ -325,7 +395,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "4b44766d-6974-b7f3-b801-eb5d9423ae49"
+    "_cell_guid": "4b44766d-6974-b7f3-b801-eb5d9423ae49",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "Let's take a brief look at our variable importance according to our random forest model. We can see that some of the original columns we predicted would be important in fact were, including gender, fare, and age. But we also see title, name length, and ticket length feature prominently, so we can pat ourselves on the back for creating such useful variables."
@@ -335,7 +408,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_cell_guid": "d77e221b-352d-8669-05d9-f7defce05709"
+    "_cell_guid": "d77e221b-352d-8669-05d9-f7defce05709",
+    "pycharm": {
+     "name": "#%%\n"
+    }
    },
    "outputs": [],
    "source": [
@@ -347,7 +423,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "_cell_guid": "f4fbf72d-a7b6-1d14-73cb-7f763d291272"
+    "_cell_guid": "f4fbf72d-a7b6-1d14-73cb-7f763d291272",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
     "Our last step is to predict the target variable for our test data and generate an output file that will be submitted to Kaggle. "
@@ -357,7 +436,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_cell_guid": "14dc0e66-9fc4-86bf-8927-46d366d4bbcf"
+    "_cell_guid": "14dc0e66-9fc4-86bf-8927-46d366d4bbcf",
+    "pycharm": {
+     "name": "#%%\n"
+    }
    },
    "outputs": [],
    "source": [
@@ -370,12 +452,20 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
    "source": []
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
    "source": [
     "This exercise is a good example of how far basic feature engineering can take you. It is worth mentioning that I did try various other models before arriving at this one. Some of the other variations I tried were different groupings for the categorical variables (plenty more combinations remain), linear discriminant analysis on a couple numeric columns, and eliminating more variables, among other things. This is a competition with a generous allotment of submission attempts, and as a result, it's quite possible that even the leaderboard score is an overestimation of the true quality of the model, since the leaderboard can act as more of a validation score instead of a true test score. \n",
     "\n",
@@ -385,7 +475,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
    "outputs": [],
    "source": []
   }