Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.
Solution: homework.ipynb
In this homework, we will use the Car price dataset like last week. Download it from here.
Or you can do it with wget
:
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
We'll work with the MSRP
variable, and we'll transform it to a classification task.
For the rest of the homework, you'll need to use only these columns:
Make
,Model
,Year
,Engine HP
,Engine Cylinders
,Transmission Type
,Vehicle Style
,highway MPG
,city mpg
,MSRP
- Keep only the columns above
- Lowercase the column names and replace spaces with underscores
- Fill the missing values with 0
- Make the price binary (1 if above the average, 0 otherwise) - this will be our target variable
above_average
Split the data into 3 parts: train/validation/test with 60%/20%/20% distribution. Use train_test_split
function for that with random_state=1
ROC AUC could also be used to evaluate feature importance of numerical variables.
Let's do that
- For each numerical variable, use it as score and compute AUC with the
above_average
variable - Use the training dataset for that
If your AUC is < 0.5, invert this variable by putting "-" in front
(e.g. -df_train['engine_hp']
)
AUC can go below 0.5 if the variable is negatively correlated with the target variable. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.
Which numerical variable (among the following 4) has the highest AUC?
engine_hp
engine_cylinders
highway_mpg
city_mpg
Apply one-hot-encoding using DictVectorizer
and train the logistic regression with these parameters:
LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
What's the AUC of this model on the validation dataset? (round to 3 digits)
- 0.678
- 0.779
- 0.878
- 0.979
Now let's compute precision and recall for our model.
- Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
- For each threshold, compute precision and recall
- Plot them
At which threshold precision and recall curves intersect?
- 0.28
- 0.48
- 0.68
- 0.88
Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both
This is the formula for computing F1:
Where
Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01
At which threshold F1 is maximal?
- 0.12
- 0.32
- 0.52
- 0.72
Use the KFold
class from Scikit-Learn to evaluate our model on 5 different folds:
KFold(n_splits=5, shuffle=True, random_state=1)
- Iterate over different folds of
df_full_train
- Split the data into train and validation
- Train the model on train with these parameters:
LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
- Use AUC to evaluate the model on validation
How large is standard deviation of the scores across different folds?
- 0.003
- 0.030
- 0.090
- 0.140
Now let's use 5-Fold cross-validation to find the best parameter C
- Iterate over the following
C
values:[0.01, 0.1, 0.5, 10]
- Initialize
KFold
with the same parameters as previously - Use these parameters for the model:
LogisticRegression(solver='liblinear', C=C, max_iter=1000)
- Compute the mean score as well as the std (round the mean and std to 3 decimal digits)
Which C
leads to the best mean score?
- 0.01
- 0.1
- 0.5
- 10
If you have ties, select the score with the lowest std. If you still have ties, select the smallest C
.
- Submit your results here: https://forms.gle/E7Fa3WuBw3HkPQYg6
- If your answer doesn't match options exactly, select the closest one.
- You can submit your solution multiple times. In this case, only the last submission will be used
The deadline for submitting is October 9 (Monday), 23:00 CET. After that the form will be closed.