Skip to content

Latest commit

 

History

History
169 lines (104 loc) · 4.49 KB

File metadata and controls

169 lines (104 loc) · 4.49 KB

Homework

Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

Solution: homework.ipynb

In this homework, we will use the Car price dataset like last week. Download it from here.

Or you can do it with wget:

wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

We'll work with the MSRP variable, and we'll transform it to a classification task.

For the rest of the homework, you'll need to use only these columns:

  • Make,
  • Model,
  • Year,
  • Engine HP,
  • Engine Cylinders,
  • Transmission Type,
  • Vehicle Style,
  • highway MPG,
  • city mpg,
  • MSRP

Data preparation

  • Keep only the columns above
  • Lowercase the column names and replace spaces with underscores
  • Fill the missing values with 0
  • Make the price binary (1 if above the average, 0 otherwise) - this will be our target variable above_average

Split the data into 3 parts: train/validation/test with 60%/20%/20% distribution. Use train_test_split function for that with random_state=1

Question 1: ROC AUC feature importance

ROC AUC could also be used to evaluate feature importance of numerical variables.

Let's do that

  • For each numerical variable, use it as score and compute AUC with the above_average variable
  • Use the training dataset for that

If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. -df_train['engine_hp'])

AUC can go below 0.5 if the variable is negatively correlated with the target variable. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

  • engine_hp
  • engine_cylinders
  • highway_mpg
  • city_mpg

Question 2: Training the model

Apply one-hot-encoding using DictVectorizer and train the logistic regression with these parameters:

LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)

What's the AUC of this model on the validation dataset? (round to 3 digits)

  • 0.678
  • 0.779
  • 0.878
  • 0.979

Question 3: Precision and Recall

Now let's compute precision and recall for our model.

  • Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
  • For each threshold, compute precision and recall
  • Plot them

At which threshold precision and recall curves intersect?

  • 0.28
  • 0.48
  • 0.68
  • 0.88

Question 4: F1 score

Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both

This is the formula for computing F1:

$$F_1 = 2 \cdot \cfrac{P \cdot R}{P + R}$$

Where $P$ is precision and $R$ is recall.

Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01

At which threshold F1 is maximal?

  • 0.12
  • 0.32
  • 0.52
  • 0.72

Question 5: 5-Fold CV

Use the KFold class from Scikit-Learn to evaluate our model on 5 different folds:

KFold(n_splits=5, shuffle=True, random_state=1)
  • Iterate over different folds of df_full_train
  • Split the data into train and validation
  • Train the model on train with these parameters: LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
  • Use AUC to evaluate the model on validation

How large is standard deviation of the scores across different folds?

  • 0.003
  • 0.030
  • 0.090
  • 0.140

Question 6: Hyperparameter Tuning

Now let's use 5-Fold cross-validation to find the best parameter C

  • Iterate over the following C values: [0.01, 0.1, 0.5, 10]
  • Initialize KFold with the same parameters as previously
  • Use these parameters for the model: LogisticRegression(solver='liblinear', C=C, max_iter=1000)
  • Compute the mean score as well as the std (round the mean and std to 3 decimal digits)

Which C leads to the best mean score?

  • 0.01
  • 0.1
  • 0.5
  • 10

If you have ties, select the score with the lowest std. If you still have ties, select the smallest C.

Submit the results

  • Submit your results here: https://forms.gle/E7Fa3WuBw3HkPQYg6
  • If your answer doesn't match options exactly, select the closest one.
  • You can submit your solution multiple times. In this case, only the last submission will be used

Deadline

The deadline for submitting is October 9 (Monday), 23:00 CET. After that the form will be closed.