forked from justmarkham/DAT7
-
Notifications
You must be signed in to change notification settings - Fork 0
/
18_ensembling_nb.py
421 lines (295 loc) · 15.6 KB
/
18_ensembling_nb.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
# # Ensembling
# *Adapted from Chapter 8 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)*
# # Part 1: Introduction
# Let's pretend that instead of building a single model to solve a classification problem, you created **five independent models**, and each model was correct about 70% of the time. If you combined these models into an "ensemble" and used their majority vote as a prediction, how often would the ensemble be correct?
import numpy as np
# set a seed for reproducibility
np.random.seed(1234)
# generate 1000 random numbers (between 0 and 1) for each model, representing 1000 observations
mod1 = np.random.rand(1000)
mod2 = np.random.rand(1000)
mod3 = np.random.rand(1000)
mod4 = np.random.rand(1000)
mod5 = np.random.rand(1000)
# each model independently predicts 1 (the "correct response") if random number was at least 0.3
preds1 = np.where(mod1 > 0.3, 1, 0)
preds2 = np.where(mod2 > 0.3, 1, 0)
preds3 = np.where(mod3 > 0.3, 1, 0)
preds4 = np.where(mod4 > 0.3, 1, 0)
preds5 = np.where(mod5 > 0.3, 1, 0)
# print the first 20 predictions from each model
print preds1[:20]
print preds2[:20]
print preds3[:20]
print preds4[:20]
print preds5[:20]
# average the predictions and then round to 0 or 1
ensemble_preds = np.round((preds1 + preds2 + preds3 + preds4 + preds5)/5.0).astype(int)
# print the ensemble's first 20 predictions
print ensemble_preds[:20]
# how accurate was each individual model?
print preds1.mean()
print preds2.mean()
print preds3.mean()
print preds4.mean()
print preds5.mean()
# how accurate was the ensemble?
print ensemble_preds.mean()
# **Ensemble learning (or "ensembling")** is the process of combining several predictive models in order to produce a combined model that is more accurate than any individual model.
# - **Regression:** take the average of the predictions
# - **Classification:** take a vote and use the most common prediction, or take the average of the predicted probabilities
# For ensembling to work well, the models must have the following characteristics:
# - **Accurate:** they outperform random guessing
# - **Independent:** their predictions are generated using different processes
# **The big idea:** If you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models.
# **Note:** As you add more models to the voting process, the probability of error decreases, which is known as [Condorcet's Jury Theorem](http://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem).
# ## Ensembling methods
# There are two basic methods for ensembling:
# - Use a model that ensembles for you
# - Manually ensemble your individual models
# What makes a good "manual ensemble"?
# - Different types of models
# - Different combinations of features
# - Different tuning parameters
# # Part 2: Bagging
# The primary weakness of **decision trees** is that they don't tend to have the best predictive accuracy. This is partially due to **high variance**, meaning that different splits in the training data can lead to very different trees.
# **Bagging** is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees. Bagging is short for **bootstrap aggregation**, meaning the aggregation of bootstrap samples.
# What is a **bootstrap sample**? A random sample with replacement:
# set a seed for reproducibility
np.random.seed(1)
# create an array of 1 through 20
nums = np.arange(1, 21)
print nums
# sample that array 20 times with replacement
print np.random.choice(a=nums, size=20, replace=True)
# **How does bagging work (for decision trees)?**
# 1. Grow B trees using B bootstrap samples from the training data.
# 2. Train each tree on its bootstrap sample and make predictions.
# 3. Combine the predictions:
# - Average the predictions for **regression trees**
# - Take a majority vote for **classification trees**
# Notes:
# - **Each bootstrap sample** should be the same size as the original training set.
# - **B** should be a large enough value that the error seems to have "stabilized".
# - The trees are **grown deep** so that they have low bias/high variance.
# Bagging increases predictive accuracy by **reducing the variance**, similar to how cross-validation reduces the variance associated with train/test split (for estimating out-of-sample error) by splitting many times an averaging the results.
# ## Manually implementing bagged decision trees (with B=10)
# read in and prepare the vehicle training data
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT7/master/data/vehicles_train.csv'
train = pd.read_csv(url)
train['vtype'] = train.vtype.map({'car':0, 'truck':1})
train
# set a seed for reproducibility
np.random.seed(123)
# create ten bootstrap samples (will be used to select rows from the DataFrame)
samples = [np.random.choice(a=14, size=14, replace=True) for _ in range(1, 11)]
samples
# show the rows for the first decision tree
train.iloc[samples[0], :]
# read in and prepare the vehicle testing data
url = 'https://raw.githubusercontent.com/justmarkham/DAT7/master/data/vehicles_test.csv'
test = pd.read_csv(url)
test['vtype'] = test.vtype.map({'car':0, 'truck':1})
test
from sklearn.tree import DecisionTreeRegressor
# grow each tree deep
treereg = DecisionTreeRegressor(max_depth=None, random_state=123)
# list for storing predicted price from each tree
predictions = []
# define testing data
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]
# grow one tree for each bootstrap sample and make predictions on testing data
for sample in samples:
X_train = train.iloc[sample, 1:]
y_train = train.iloc[sample, 0]
treereg.fit(X_train, y_train)
y_pred = treereg.predict(X_test)
predictions.append(y_pred)
# convert predictions from list to NumPy array
predictions = np.array(predictions)
predictions
# average predictions
np.mean(predictions, axis=0)
# calculate RMSE
from sklearn import metrics
y_pred = np.mean(predictions, axis=0)
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
# ## Bagged decision trees in scikit-learn (with B=500)
# define the training and testing sets
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]
# instruct BaggingRegressor to use DecisionTreeRegressor as the "base estimator"
from sklearn.ensemble import BaggingRegressor
bagreg = BaggingRegressor(DecisionTreeRegressor(), n_estimators=500, bootstrap=True, oob_score=True, random_state=1)
# fit and predict
bagreg.fit(X_train, y_train)
y_pred = bagreg.predict(X_test)
y_pred
# calculate RMSE
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
# ## Estimating out-of-sample error
# For bagged models, out-of-sample error can be estimated without using **train/test split** or **cross-validation**!
# On average, each bagged tree uses about **two-thirds** of the observations. For each tree, the **remaining observations** are called "out-of-bag" observations.
# show the first bootstrap sample
samples[0]
# show the "in-bag" observations for each sample
for sample in samples:
print set(sample)
# show the "out-of-bag" observations for each sample
for sample in samples:
print sorted(set(range(14)) - set(sample))
# How to calculate **"out-of-bag error":**
# 1. For every observation in the training data, predict its response value using **only** the trees in which that observation was out-of-bag. Average those predictions (for regression) or take a majority vote (for classification).
# 2. Compare all predictions to the actual response values in order to compute the out-of-bag error.
# When B is sufficiently large, the **out-of-bag error** is an accurate estimate of **out-of-sample error**.
# compute the out-of-bag R-squared score (not MSE, unfortunately!) for B=500
bagreg.oob_score_
# ## Estimating feature importance
# Bagging increases **predictive accuracy**, but decreases **model interpretability** because it's no longer possible to visualize the tree to understand the importance of each feature.
# However, we can still obtain an overall summary of **feature importance** from bagged models:
# - **Bagged regression trees:** calculate the total amount that **MSE** is decreased due to splits over a given feature, averaged over all trees
# - **Bagged classification trees:** calculate the total amount that **Gini index** is decreased due to splits over a given feature, averaged over all trees
# # Part 3: Random Forests
# Random Forests is a **slight variation of bagged trees** that has even better performance:
# - Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.
# - However, when building each tree, each time a split is considered, a **random sample of m features** is chosen as split candidates from the **full set of p features**. The split is only allowed to use **one of those m features**.
# - A new random sample of features is chosen for **every single tree at every single split**.
# - For **classification**, m is typically chosen to be the square root of p.
# - For **regression**, m is typically chosen to be somewhere between p/3 and p.
# What's the point?
# - Suppose there is **one very strong feature** in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are **highly correlated**.
# - Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).
# - By randomly leaving out candidate features from each split, **Random Forests "decorrelates" the trees**, such that the averaging process can reduce the variance of the resulting model.
# # Part 4: Comparing Decision Trees and Random Forests
# ## Exploring and preparing the data
# read in the baseball salary data
url = 'https://raw.githubusercontent.com/justmarkham/DAT7/master/data/hitters.csv'
hitters = pd.read_csv(url)
hitters.head()
# show a cross-tabulation of League and NewLeague
pd.crosstab(hitters.League, hitters.NewLeague)
# check for missing values
hitters.isnull().sum()
# remove rows with missing values
hitters.dropna(inplace=True)
# factorize encodes categorical values as integers
pd.factorize(hitters.League)
# convert to dummy variables
hitters['League'] = pd.factorize(hitters.League)[0]
hitters['Division'] = pd.factorize(hitters.Division)[0]
hitters['NewLeague'] = pd.factorize(hitters.NewLeague)[0]
hitters.head()
# histogram of Salary
hitters.Salary.plot(kind='hist')
# scatter plot of Years versus Hits colored by Salary
hitters.plot(kind='scatter', x='Years', y='Hits', c='Salary', colormap='jet', xlim=(0, 25), ylim=(0, 250))
# exclude columns which represent career statistics
feature_cols = hitters.columns[hitters.columns.str.startswith('C') == False]
# exclude the response
feature_cols = feature_cols.drop('Salary')
# define X and y
X = hitters[feature_cols]
y = hitters.Salary
# ## Predicting salary with a decision tree
# Find the best **max_depth** for a decision tree using cross-validation:
# list of values to try for max_depth
max_depth_range = range(1, 21)
# list to store the average RMSE for each value of max_depth
RMSE_scores = []
# use 10-fold cross-validation with each value of max_depth
from sklearn.cross_validation import cross_val_score
for depth in max_depth_range:
treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
MSE_scores = cross_val_score(treereg, X, y, cv=10, scoring='mean_squared_error')
RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
# plot max_depth (x-axis) versus RMSE (y-axis)
import matplotlib.pyplot as plt
plt.plot(max_depth_range, RMSE_scores)
plt.xlabel('max_depth')
plt.ylabel('RMSE (lower is better)')
# show the best RMSE and the corresponding max_depth
sorted(zip(RMSE_scores, max_depth_range))[0]
# max_depth=2 was best, so fit a tree using that parameter
treereg = DecisionTreeRegressor(max_depth=2, random_state=1)
treereg.fit(X, y)
# compute feature importances
pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_}).sort('importance')
# ## Predicting salary with a Random Forest
from sklearn.ensemble import RandomForestRegressor
rfreg = RandomForestRegressor()
rfreg
# One important tuning parameter is **n_estimators:** the number of trees that should be grown.
# list of values to try for n_estimators
estimator_range = range(10, 310, 10)
# list to store the average RMSE for each value of n_estimators
RMSE_scores = []
# use 5-fold cross-validation with each value of n_estimators
for estimator in estimator_range:
rfreg = RandomForestRegressor(n_estimators=estimator, random_state=1)
MSE_scores = cross_val_score(rfreg, X, y, cv=5, scoring='mean_squared_error')
RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
# plot n_estimators (x-axis) versus RMSE (y-axis)
plt.plot(estimator_range, RMSE_scores)
plt.xlabel('n_estimators')
plt.ylabel('RMSE (lower is better)')
# **n_estimators** should be a large enough value that the error seems to have "stabilized".
# The other important tuning parameter is **max_features:** the number of features that should be considered at each split.
# list of values to try for max_features
feature_range = range(1, len(feature_cols)+1)
# list to store the average RMSE for each value of max_features
RMSE_scores = []
# use 10-fold cross-validation with each value of max_features
for feature in feature_range:
rfreg = RandomForestRegressor(n_estimators=150, max_features=feature, random_state=1)
MSE_scores = cross_val_score(rfreg, X, y, cv=10, scoring='mean_squared_error')
RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
# plot max_features (x-axis) versus RMSE (y-axis)
plt.plot(feature_range, RMSE_scores)
plt.xlabel('max_features')
plt.ylabel('RMSE (lower is better)')
# show the best RMSE and the corresponding max_features
sorted(zip(RMSE_scores, feature_range))[0]
# max_features=8 was best, so fit a Random Forest using that parameter
rfreg = RandomForestRegressor(n_estimators=150, max_features=8, oob_score=True, random_state=1)
rfreg.fit(X, y)
# compute feature importances
pd.DataFrame({'feature':feature_cols, 'importance':rfreg.feature_importances_}).sort('importance')
# compute the out-of-bag R-squared score
rfreg.oob_score_
# ## Reduce X to its most important features
# check the shape of X
X.shape
# set a threshold for which features to include
print rfreg.transform(X, threshold=0.1).shape
print rfreg.transform(X, threshold='mean').shape
print rfreg.transform(X, threshold='median').shape
# create a new feature matrix that only include important features
X_important = rfreg.transform(X, threshold='mean')
# check the RMSE for a Random Forest that only uses important features
rfreg = RandomForestRegressor(n_estimators=150, max_features=3, random_state=1)
scores = cross_val_score(rfreg, X_important, y, cv=10, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
# # Part 5: Conclusion
# ## Comparing Random Forests with Decision Trees
# **Advantages of Random Forests:**
# - Performance is competitive with the best supervised learning methods
# - Provides a more reliable estimate of feature importance
# - Allows you to estimate out-of-sample error without using train/test split or cross-validation
# **Disadvantages of Random Forests:**
# - Less interpretable
# - Slower to train
# - Slower to predict
# ## Comparing "manual" ensembling with a single model approach
# **Advantages of ensembling:**
# - Increases predictive accuracy
# - Easy to get started
# **Disadvantages of ensembling:**
# - Decreases interpretability
# - Takes longer to train
# - Takes longer to predict
# - More complex to automate and maintain
# - Small gains in accuracy may not be worth the added complexity