Let's practice what we've learned using the Glass Identification dataset.
- Read the data into a DataFrame.
- Briefly explore the data to make sure the DataFrame matches your expectations.
- Let's convert this into a binary classification problem. Create a new DataFrame column called "binary":
- If type of glass = 1/2/3/4, set binary = 0.
- If type of glass = 5/6/7, set binary = 1.
- Create a feature matrix "X" using all features. (Think carefully about which columns are actually features!)
- Create a response vector "y" from the "binary" column.
- Split X and y into training and testing sets.
- Fit a KNN model on the training set using K=5.
- Make predictions on the testing set and calculate testing accuracy.
- Write a for loop that computes the testing accuracy for a range of K values.
- Plot the K value versus testing accuracy to help you choose an optimal value for K.
- Calculate the testing accuracy that could be achieved by always predicting the most frequent class in the testing set. (This is known as the "null accuracy".)
- Bonus: Explore the data to determine which features look like good predictors, and then redo this exercise using only those features to see if you can achieve a higher testing accuracy!