Skip to content

Commit

Permalink
delete kernel
Browse files Browse the repository at this point in the history
  • Loading branch information
Aurelien Massiot committed Sep 20, 2023
1 parent 9d8801f commit b59e445
Show file tree
Hide file tree
Showing 2 changed files with 124 additions and 449 deletions.
154 changes: 124 additions & 30 deletions notebook/titanic.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "b08819f5-01fc-640e-c570-d7c370f34014"
"_cell_guid": "b08819f5-01fc-640e-c570-d7c370f34014",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"This document is a thorough overview of my process for building a predictive model for Kaggle's Titanic competition. I will provide all my essential steps in this model as well as the reasoning behind each decision I made. This model achieves a score of 82.78%, which is in the top 3% of all submissions at the time of this writing. This is a great introductory modeling exercise due to the simple nature of the data, yet there is still a lot to be gleaned from following a process that ultimately yields a high score.\n",
Expand All @@ -15,7 +18,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "381ac8a1-9f18-f1bd-67ed-f165373c8d0f"
"_cell_guid": "381ac8a1-9f18-f1bd-67ed-f165373c8d0f",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### The Problem"
Expand All @@ -24,7 +30,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "76f20215-7870-47db-e2f0-c253a43aa8db"
"_cell_guid": "76f20215-7870-47db-e2f0-c253a43aa8db",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"We are given information about a subset of the Titanic population and asked to build a predictive model that tells us whether or not a given passenger survived the shipwreck. We are given 10 basic explanatory variables, including passenger gender, age, and price of fare, among others. More details about the competition can be found on the Kaggle site, [here](https://www.kaggle.com/c/titanic). This is a classic binary classification problem, and we will be implementing a random forest classifer."
Expand All @@ -33,7 +42,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "0941247d-c2f1-a753-e4ff-583a88f2e7dc"
"_cell_guid": "0941247d-c2f1-a753-e4ff-583a88f2e7dc",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Exploratory Data Analysis"
Expand All @@ -42,7 +54,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "e99bef63-bd89-42e9-6897-aba6337b2afb"
"_cell_guid": "e99bef63-bd89-42e9-6897-aba6337b2afb",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The goal of this section is to gain an understanding of our data in order to inform what we do in the feature engineering section. \n",
Expand All @@ -54,7 +69,10 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "9ae4a31b-44ce-72b7-375b-1376bcc81142"
"_cell_guid": "9ae4a31b-44ce-72b7-375b-1376bcc81142",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -66,7 +84,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "6e6e109b-469e-4210-18a9-59d56448fddc"
"_cell_guid": "6e6e109b-469e-4210-18a9-59d56448fddc",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"We then load the data, which we have downloaded from the Kaggle website ([here](https://www.kaggle.com/c/titanic/data) is a link to the data if you need it)."
Expand All @@ -76,7 +97,10 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "e117a178-539a-d880-ab8c-5306d6d671f0"
"_cell_guid": "e117a178-539a-d880-ab8c-5306d6d671f0",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -87,7 +111,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "d67e8e9f-d809-7138-9ebc-c8c9fa1fb88c"
"_cell_guid": "d67e8e9f-d809-7138-9ebc-c8c9fa1fb88c",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"First, let's take a look at the summary of all the data. Immediately, we note that `Age`, `Cabin`, and `Embarked` have nulls that we'll have to deal with. "
Expand All @@ -97,7 +124,10 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "166004fb-0092-7fb9-f890-1b764a7f6da9"
"_cell_guid": "166004fb-0092-7fb9-f890-1b764a7f6da9",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -107,7 +137,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "928cc3b0-b08d-cd49-0c0a-993f51cc5070"
"_cell_guid": "928cc3b0-b08d-cd49-0c0a-993f51cc5070",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"It appears that we can drop the `PassengerId` column, since it is merely an index. Note, however, that some people have reportedly improved their score with the `PassengerId` column. However, my cursory attempt to do so did not yield positive results, and moreover I would like to mimic a real-life scenario, where an index of a dataset generally has no correlation with the target variable."
Expand All @@ -117,7 +150,10 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "2fee872c-6233-57b1-d6a8-d017ef15edbd"
"_cell_guid": "2fee872c-6233-57b1-d6a8-d017ef15edbd",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -127,7 +163,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "792b218a-b4c0-7a4a-2442-f375deee3581"
"_cell_guid": "792b218a-b4c0-7a4a-2442-f375deee3581",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Feature Engineering"
Expand All @@ -136,7 +175,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "a69433b7-96b7-589b-4481-180206c1e5b2"
"_cell_guid": "a69433b7-96b7-589b-4481-180206c1e5b2",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Having done our cursory exploration of the variables, we now have a pretty good idea of how we want to transform our variables in preparation for our final dataset. We will perform our feature engineering through a series of helper functions that each serve a specific purpose. "
Expand All @@ -159,7 +201,11 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from src.feature_engineering import *"
Expand All @@ -168,7 +214,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "8d1efbbb-e0dd-f3d4-acd3-ef9eb5c396e2"
"_cell_guid": "8d1efbbb-e0dd-f3d4-acd3-ef9eb5c396e2",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Having built our helper functions, we can now execute them in order to build our dataset that will be used in the model:a"
Expand All @@ -178,7 +227,10 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "d004f91d-1c6b-e281-6e4c-45b44eadbcca"
"_cell_guid": "d004f91d-1c6b-e281-6e4c-45b44eadbcca",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -192,7 +244,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "35d843bc-3607-f11a-55f9-2828cf5eb91e"
"_cell_guid": "35d843bc-3607-f11a-55f9-2828cf5eb91e",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"We can see that our final dataset has 55 columns, composed of our target column and 54 predictor variables. Although highly dimensional datasets can result in high variance, I think we should be fine here. "
Expand All @@ -202,7 +257,10 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "09391302-b621-4730-7589-7eb017286e7f"
"_cell_guid": "09391302-b621-4730-7589-7eb017286e7f",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -212,7 +270,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "1066e65e-e578-e896-5c38-1457a947ec6f"
"_cell_guid": "1066e65e-e578-e896-5c38-1457a947ec6f",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Hyperparameter Tuning"
Expand All @@ -233,7 +294,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "7f6c54fa-033e-075f-0e86-c9c0b469a03b"
"_cell_guid": "7f6c54fa-033e-075f-0e86-c9c0b469a03b",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"from sklearn.model_selection import GridSearchCV \n",
Expand Down Expand Up @@ -294,7 +358,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "e494ad2b-92e3-782f-13c1-f53a86602298"
"_cell_guid": "e494ad2b-92e3-782f-13c1-f53a86602298",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"We are now ready to fit our model using the optimal hyperparameters. The out-of-bag score can give us an unbiased estimate of the model accuracy, and we can see that the score is 83.73%, which is only a little higher than our final leaderboard score."
Expand All @@ -304,7 +371,10 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "5593980a-4145-9594-299c-f4d1a9f01970"
"_cell_guid": "5593980a-4145-9594-299c-f4d1a9f01970",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -325,7 +395,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "4b44766d-6974-b7f3-b801-eb5d9423ae49"
"_cell_guid": "4b44766d-6974-b7f3-b801-eb5d9423ae49",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's take a brief look at our variable importance according to our random forest model. We can see that some of the original columns we predicted would be important in fact were, including gender, fare, and age. But we also see title, name length, and ticket length feature prominently, so we can pat ourselves on the back for creating such useful variables."
Expand All @@ -335,7 +408,10 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "d77e221b-352d-8669-05d9-f7defce05709"
"_cell_guid": "d77e221b-352d-8669-05d9-f7defce05709",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -347,7 +423,10 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "f4fbf72d-a7b6-1d14-73cb-7f763d291272"
"_cell_guid": "f4fbf72d-a7b6-1d14-73cb-7f763d291272",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Our last step is to predict the target variable for our test data and generate an output file that will be submitted to Kaggle. "
Expand All @@ -357,7 +436,10 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "14dc0e66-9fc4-86bf-8927-46d366d4bbcf"
"_cell_guid": "14dc0e66-9fc4-86bf-8927-46d366d4bbcf",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -370,12 +452,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"This exercise is a good example of how far basic feature engineering can take you. It is worth mentioning that I did try various other models before arriving at this one. Some of the other variations I tried were different groupings for the categorical variables (plenty more combinations remain), linear discriminant analysis on a couple numeric columns, and eliminating more variables, among other things. This is a competition with a generous allotment of submission attempts, and as a result, it's quite possible that even the leaderboard score is an overestimation of the true quality of the model, since the leaderboard can act as more of a validation score instead of a true test score. \n",
"\n",
Expand All @@ -385,7 +475,11 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": []
}
Expand Down
Loading

0 comments on commit b59e445

Please sign in to comment.