This is a solution notebook to an assignment question given in a Data Mining graduate course. Each code block is accompanied by relevant analysis wherever required.
Dataset link : https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset
Broadly, the following steps have been performed in this solution notebook:
- A custom designed train-test split method which splits the data into training, validation and test set (70:15:15).
- Decision Tree trained using both the Gini index and the Entropy by changing the max-depth as [4,8,10,15,20].
- The splitting criteria (gini/entropy) selected is the one which gives better accuracy on test set with the chosen depth.
- Ensembling method is a method to combine multiple not-so-good models to get a better performing model.
- Created 100 different decision stumps (max-depth 3). For each stump trained it on a randomly selected 50% of the training data i.e. selelct data for each stump separately
- Finally predicted the test samples's labels by taking a majority vote of the output of the stumps.
- Used the sklearn Adaboost algorithm on the above dataset and reported the testing accuracy.
- Decision tree is used as the base estimator and with a number of estimators as [4,8,10,15,20].
- Further compared the Random Forest and Adaboost results.