Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.
Solution: homework.ipynb
In this homework, we will use the Bank Marketing dataset. Download it from here.
You can do it with wget
:
wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
unzip bank+marketing.zip
unzip bank.zip
We need bank-full.csv
.
In this dataset the target variable is y
variable - has the client subscribed a term deposit or not.
For the rest of the homework, you'll need to use only these columns:
'age'
,'job'
,'marital'
,'education'
,'balance'
,'housing'
,'contact'
,'day'
,'month'
,'duration'
,'campaign'
,'pdays'
,'previous'
,'poutcome'
,'y'
Split the data into 3 parts: train/validation/test with 60%/20%/20% distribution. Use train_test_split
function for that with random_state=1
ROC AUC could also be used to evaluate feature importance of numerical variables.
Let's do that
- For each numerical variable, use it as score (aka prediction) and compute the AUC with the
y
variable as ground truth. - Use the training dataset for that
If your AUC is < 0.5, invert this variable by putting "-" in front
(e.g. -df_train['engine_hp']
)
AUC can go below 0.5 if the variable is negatively correlated with the target variable. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.
Which numerical variable (among the following 4) has the highest AUC?
balance
day
duration
previous
Apply one-hot-encoding using DictVectorizer
and train the logistic regression with these parameters:
LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
What's the AUC of this model on the validation dataset? (round to 3 digits)
- 0.69
- 0.79
- 0.89
- 0.99
Now let's compute precision and recall for our model.
- Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
- For each threshold, compute precision and recall
- Plot them
At which threshold precision and recall curves intersect?
- 0.265
- 0.465
- 0.665
- 0.865
Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both
This is the formula for computing F1:
Where
Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01
At which threshold F1 is maximal?
- 0.02
- 0.22
- 0.42
- 0.62
Use the KFold
class from Scikit-Learn to evaluate our model on 5 different folds:
KFold(n_splits=5, shuffle=True, random_state=1)
- Iterate over different folds of
df_full_train
- Split the data into train and validation
- Train the model on train with these parameters:
LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
- Use AUC to evaluate the model on validation
How large is standard deviation of the scores across different folds?
- 0.0001
- 0.006
- 0.06
- 0.26
Now let's use 5-Fold cross-validation to find the best parameter C
- Iterate over the following
C
values:[0.000001, 0.001, 1]
- Initialize
KFold
with the same parameters as previously - Use these parameters for the model:
LogisticRegression(solver='liblinear', C=C, max_iter=1000)
- Compute the mean score as well as the std (round the mean and std to 3 decimal digits)
Which C
leads to the best mean score?
- 0.000001
- 0.001
- 1
If you have ties, select the score with the lowest std. If you still have ties, select the smallest C
.
- Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw04
- If your answer doesn't match options exactly, select the closest one