Skip to content

Latest commit

 

History

History
165 lines (104 loc) · 4.13 KB

File metadata and controls

165 lines (104 loc) · 4.13 KB

Homework

Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

Solution: homework.ipynb

In this homework, we will use the Bank Marketing dataset. Download it from here.

You can do it with wget:

wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
unzip bank+marketing.zip 
unzip bank.zip

We need bank-full.csv.

In this dataset the target variable is y variable - has the client subscribed a term deposit or not.

Dataset preparation

For the rest of the homework, you'll need to use only these columns:

  • 'age',
  • 'job',
  • 'marital',
  • 'education',
  • 'balance',
  • 'housing',
  • 'contact',
  • 'day',
  • 'month',
  • 'duration',
  • 'campaign',
  • 'pdays',
  • 'previous',
  • 'poutcome',
  • 'y'

Split the data into 3 parts: train/validation/test with 60%/20%/20% distribution. Use train_test_split function for that with random_state=1

Question 1: ROC AUC feature importance

ROC AUC could also be used to evaluate feature importance of numerical variables.

Let's do that

  • For each numerical variable, use it as score (aka prediction) and compute the AUC with the y variable as ground truth.
  • Use the training dataset for that

If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. -df_train['engine_hp'])

AUC can go below 0.5 if the variable is negatively correlated with the target variable. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

  • balance
  • day
  • duration
  • previous

Question 2: Training the model

Apply one-hot-encoding using DictVectorizer and train the logistic regression with these parameters:

LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)

What's the AUC of this model on the validation dataset? (round to 3 digits)

  • 0.69
  • 0.79
  • 0.89
  • 0.99

Question 3: Precision and Recall

Now let's compute precision and recall for our model.

  • Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
  • For each threshold, compute precision and recall
  • Plot them

At which threshold precision and recall curves intersect?

  • 0.265
  • 0.465
  • 0.665
  • 0.865

Question 4: F1 score

Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both

This is the formula for computing F1:

$$F_1 = 2 \cdot \cfrac{P \cdot R}{P + R}$$

Where $P$ is precision and $R$ is recall.

Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01

At which threshold F1 is maximal?

  • 0.02
  • 0.22
  • 0.42
  • 0.62

Question 5: 5-Fold CV

Use the KFold class from Scikit-Learn to evaluate our model on 5 different folds:

KFold(n_splits=5, shuffle=True, random_state=1)
  • Iterate over different folds of df_full_train
  • Split the data into train and validation
  • Train the model on train with these parameters: LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
  • Use AUC to evaluate the model on validation

How large is standard deviation of the scores across different folds?

  • 0.0001
  • 0.006
  • 0.06
  • 0.26

Question 6: Hyperparameter Tuning

Now let's use 5-Fold cross-validation to find the best parameter C

  • Iterate over the following C values: [0.000001, 0.001, 1]
  • Initialize KFold with the same parameters as previously
  • Use these parameters for the model: LogisticRegression(solver='liblinear', C=C, max_iter=1000)
  • Compute the mean score as well as the std (round the mean and std to 3 decimal digits)

Which C leads to the best mean score?

  • 0.000001
  • 0.001
  • 1

If you have ties, select the score with the lowest std. If you still have ties, select the smallest C.

Submit the results