Greetings to everyone who lived to see part five of our coursed!
The course has already brought together more than 1,000 participants, of which 520, 450 and 360 persons respectively completed first 3 homework assignments. About 200 participants have so far the maximum score. The outflow is much lower than in MOOCs, even though our articles are large.
This session will be devoted to simple methods of composition: bagging and random forest. You will learn how you can get the distribution of the mean of the general population, if we have information only on a small part of it. We'll look how to reduce the variance and thus improve the accuracy of the model using compositions of algorithms. We'll also figure out what a random forest is, what parameters in it need to be "tuned up" and how to find the most important feature. We'll focus on practice with just a "pinch" of mathematics.
1. [Первичный анализ данных с Pandas](https://habrahabr.ru/company/ods/blog/322626/) 2. [Визуальный анализ данных c Python](https://habrahabr.ru/company/ods/blog/323210/) 3. [Классификация, деревья решений и метод ближайших соседей](https://habrahabr.ru/company/ods/blog/322534/) 4. [Линейные модели классификации и регрессии](https://habrahabr.ru/company/ods/blog/323890/) 5. [Композиции: бэггинг, случайный лес](https://habrahabr.ru/company/ods/blog/324402/) 6. [Построение и отбор признаков. Приложения в задачах обработки текста, изображений и геоданных](https://habrahabr.ru/company/ods/blog/325422/) 7. Обучение без учителя: PCA, кластеризация, поиск аномалий- Bagging
- Random forest
- Оценка важности признаков
- Плюсы и минусы случайного леса
- Домашнее задание №5
- Полезные источники
##1. Bagging From the last lectures you learned about different classification algorithms, as well as how to validate and evaluate the quality of the model. But what if you've already found the best model and can no longer improve the accuracy of the model? In this case, you need to apply more advanced machine learning techniques, which can be jointly called "ensembles". Ensemble is a certain set, parts of which form a whole. From everyday life you know music ensembles, where several musical instruments are combined, or architectural ensembles with various buildings, etc.
A good example of ensembles is represented by Condorcet's jury theorem (1784). If each member of the jury has an independent opinion and the probability of correct decision of each juror is more than 0.5, then the probability of correct decision of the whole jury increases with the number of jurors and tends to one. At the same time, if the probability of being right is less than 0.5 for each juror, then the probability of correct decision by the jury decreases monotonically and tends to zero with an increasing number of jurors.
Let's look at another example of ensembles - "Wisdom of Crowds". Francis Galton in 1906 visited the market where a certain lottery was held for farmers. There were about 800 people, and they tried to guess the weight of the bull that stood in front of them. Bull weighed 1,198 pounds. None of the farmers guessed the exact weight of the bull, but if we computed the average of their predictions, we'd get 1197 pounds. This idea of error reduction was also used in machine learning.
Bagging (from Bootstrap aggregation) is one of the first and most basic kinds of ensembles. It was coined by Leo Breiman in 1994.Bagging is based on a statistical method of bootstrap that allows to evaluate many parameters of complex distributions.
Bootstrap method consists of the following. Suppose we have a sample
Let's take telecom_churn
dataset that is already known to you from past lessons of our course. Recall that it is the task of binary classification of customer churn. One of the most important features in the dataset that is the number of calls to the service center that were made by the client. Let's try to vizualize the data and look at the distribution of this feature.
telecom_data = pd.read_csv('data/telecom_churn.csv')
fig = sns.kdeplot(telecom_data[telecom_data['Churn'] == False]['Customer service calls'], label = 'Loyal')
fig = sns.kdeplot(telecom_data[telecom_data['Churn'] == True]['Customer service calls'], label = 'Churn')
fig.set(xlabel=Number of calls, ylabel=Density)
plt.show()
</spoiler>
![image](https://github.com/Yorko/mlcourse_open/blob/master/img/bootstrap_new.png?raw=true)
As you can see, loyal customers make fewer calls to the service center than our former clients. Now it'd be a good idea to estimate the average number of calls done by each group. Since we have little data in the dataset, computing the mean is not quite right, it is better to apply our new knowledge of bootstrap. Let's generate 1000 new subsamples of our general population and do interval estimation of the mean.
<spoiler title="Code for constructing a confidence interval using the bootstrap">
```python
import numpy as np
def get_bootstrap_samples(data, n_samples):
# Function to generate the sub-samples using bootstrap
indices = np.random.randint(0, len(data), (n_samples, len(data)))
samples = data[indices]
return samples
def stat_intervals(stat, alpha):
# Function for interval estimation
boundaries = np.percentile(stat, [100 * alpha / 2., 100 * (1 - alpha / 2.)])
return boundaries
# Saving datasets on loyal and former clients to separate numpy arrays
loyal_calls = telecom_data[telecom_data['Churn'] == False]['Customer service calls'].values
churn_calls= telecom_data[telecom_data['Churn'] == True]['Customer service calls'].values
# Fix the seed for reproducibility
np.random.seed(0)
# Generate samples using bootstrap and immediately compute mean for each of them
loyal_mean_scores = [np.mean(sample)
for sample in get_bootstrap_samples(loyal_calls, 1000)]
churn_mean_scores = [np.mean(sample)
for sample in get_bootstrap_samples(churn_calls, 1000)]
# Print interval estimate of the mean
print("Service calls from loyal: mean interval", stat_intervals(loyal_mean_scores, 0.05))
print("Service calls from churn: mean interval", stat_intervals(churn_mean_scores, 0.05))
As a result, we have found that with 95% probability the average number of calls from loyal customers will lie between 1.40 and 1.50, while our former clients called on average 2.06-2.40 times. It is also worth noting that the interval for loyal customers is narrower, which is quite logical, as they call rarely (mostly 0, 1 or 2 times), and dissatisfied customers will call more often, but over time their patience is over, and they will change the operator.
Now that you have an idea of bootstrap we can proceed to bagging intself. Suppose we have a training set
Let's consider regression problem with basic algorithms
The average error of the constructed regression functions has the form
Suppose that the errors are uncorrelated and unbiased:
Let's now construct a new regression function that will average responses from constructed functions:
Let's find its mean square error:
Thus, averaging the responses reduced the average square error by n times!
A reminder from our previous lesson about how the error is factorized:
Bagging reduces the variance of the classifier by reducing the amount by which the error will differ if the model is trained on different sets of data, or in other words, prevents overfitting. Bagging efficiency is achieved due to the fact that the basic algorithms that are trained on different subsamples appear to be quite different, and their errors cancel each other out in voting. Also outliers will not get into some of the training sets.
scikit-learn
library has an implementation of BaggingRegressor
and BaggingClassifier
which allow you to use most of the other algorithms "inside". Let's look how bagging works in practice and compare it with the decision tree using the example from documentation.
Error of decision tree
The plot and the results above show that the error of the variance is much less with bagging, as we have shown theoretically above.
Bagging is effective on small samples when exclusion of even a small part of training objects leads to the construction of significantly different base classifiers. In the case of large samples, it's common to generate significantly smaller subsamples.
It should be noted that the above example cannot not put into practice, because we made the assumption that errors are uncorrelated, and that is rarely fulfilled. If this assumption is wrong, the error reduction is not so significant. In the next parts we will look at more complex methods of combining algorithms in compositions, which allow to achieve high quality in real problems.
Looking ahead, we'll note that when using random forests there's no need in cross-validation or in a separate test set to get an unbiased evaluation of the test sets error. Let's see how the "internal" evaluation is obtained during the model training.
Each tree is constructed using different bootstrap samples from the original data. Approximately 37% of the examples remain outside the bootstrap sample and are not used in the construction of the k-th tree.
It is easy to prove: let the sample have
Let's see how this works in practice:
The figure shows the estimation of OOB-error. The top image is our initial sample, we divide it into training (left) and test (right) sets. On the right figure, we have a grid of squares that perfectly divides our sample. Now we need to estimate the proportion of correct answers on our test sample. The figure shows that our classifier was mistaken in 4 cases which were not used in training. So, the proportion of correct answers of our classifier is
It turns out, that each basic algorithm is trained on ~ 63% of the original objects. So it can be checked right away on the remaining ~ 37%. Out-of-Bag estimate is the average score of the basic algorithms on these ~ 37% of the data on which they were not trained.
Leo Breiman found application for bootstrap not only in statistics, but also in machine learning. He, along with Adel Cutler improved random forest algorithm, proposed by Ho, by adding to the original version construction of uncorrelated trees based on CART, in combination with the method and random subspaces and bagging.
Decision trees are good for the family of base classifiers bagging because they are quite complicated and can achieve zero error for any sample. Random subspace method can reduce the correlation between the trees and avoid overfitting. Base algorithms are trained on different subsets of feature vectors which are also allocated randomly. The ensemble of models using the method of random subspace can be constructed using the following algorithm:
- Let the number of objects for training be equal to
$inline$ \large N$inline$, and the number of features$inline$ \large D$inline$. - Select
$inline$ \large L$inline$ as a number of individual models in the ensemble. - For each model
$inline$ \large l$inline$, select$inline$ \large dl (dl < D)$inline$ as the number of features for$inline$ \large l$inline$. Usually, for all models only one value$inline$ \large dl$inline$. is used. - For each model
$inline$ \large l$inline$ create a training set by choosing$inline$ \large dl$inline$ features from$inline$ \large D$inline$ and train a model. - Now, to apply the ensemble model to a new object, combine the results of individual
$inline$ \large L$inline$ models by majority voting, or by combining the posterior probabilities.
An algorithm for constructing a random forest consisting of
- For each
$inline$ \large n = 1, \dots, N$inline$:- Generate sample
$inline$ \large X_n$inline$ via bootstrap; - Build a decision tree
$inline$ \large b_n$inline$ on sample$inline$ \large X_n$inline$: — we choose the best feature according to a defined criteria, do the partition in a tree and do so until the exhaustion of the sample — the tree is constructed until there's no more than$inline$ \large n_\text{min}$inline$ objects in each leaf, or until you reach a certain height of the tree — for each partition,$inline$ \large m$inline$ random features are selected first from$inline$ \large n$inline$ s initial features, and the optimal division of the sample is searched only among them.
- Generate sample
The final classifier is
It is recommended to take
Thus, random forest is bagging on such decision trees, where for each partition features are selected from a random subset of features.
```python from __future__ import division, print_function # Disable any warnings from Anaconda import warnings warnings.filterwarnings('ignore') %pylab inline np.random.seed(42) figsize(8, 6) import seaborn as sns from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, BaggingRegressor from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifiern_train = 150
n_test = 1000
noise = 0.1
def f(x): x = x.ravel() return np.exp(-x ** 2) + 1.5 * np.exp(-(x - 2) ** 2)
def generate(n_samples, noise):
X = np.random.rand(n_samples) * 10 - 5
X = np.sort(X).ravel()
y = np.exp(-X ** 2) + 1.5 * np.exp(-(X - 2) ** 2)
+ np.random.normal(0.0, noise, n_samples)
X = X.reshape((n_samples, 1))
return X, y
X_train, y_train = generate(n_samples=n_train, noise=noise) X_test, y_test = generate(n_samples=n_test, noise=noise)
dtree = DecisionTreeRegressor().fit(X_train, y_train) d_predict = dtree.predict(X_test)
plt.figure(figsize=(10, 6)) plt.plot(X_test, f(X_test), "b") plt.scatter(X_train, y_train, c="b", s=20) plt.plot(X_test, d_predict, "g", lw=2) plt.xlim([-5, 5]) plt.title("Решающее дерево, MSE = %.2f" % np.sum((y_test - d_predict) ** 2))
bdt = BaggingRegressor(DecisionTreeRegressor()).fit(X_train, y_train) bdt_predict = bdt.predict(X_test)
plt.figure(figsize=(10, 6)) plt.plot(X_test, f(X_test), "b") plt.scatter(X_train, y_train, c="b", s=20) plt.plot(X_test, bdt_predict, "y", lw=2) plt.xlim([-5, 5]) plt.title("Bagging of decision trees, MSE = %.2f" % np.sum((y_test - bdt_predict) ** 2));
rf = RandomForestRegressor(n_estimators=10).fit(X_train, y_train) rf_predict = rf.predict(X_test)
plt.figure(figsize=(10, 6)) plt.plot(X_test, f(X_test), "b") plt.scatter(X_train, y_train, c="b", s=20) plt.plot(X_test, rf_predict, "r", lw=2) plt.xlim([-5, 5]) plt.title("Random forest, MSE = %.2f" % np.sum((y_test - rf_predict) ** 2));
</spoiler>
![image](https://github.com/Yorko/mlcourse_open/blob/master/img/tree_mse.png?raw=true)
![image](https://github.com/Yorko/mlcourse_open/blob/master/img/bagging_mse.png?raw=true)
![image](https://github.com/Yorko/mlcourse_open/blob/master/img/rf_mse.png?raw=true)
As we can see from the graphs and MSE error values, a random forest of 10 trees gives better results than a single tree or bagging on 10 decision trees. The main difference between random forest and bagging on decision trees is that random forest randomly selects a subset of features, and the best feature for the node partition is selected from this subset, unlike bagging where all functions are considered for the partition of the node.
You can also see the benefit of random forest and bagging in classification problems.
<spoiler title="Code for comparing the decision tree, bagging and random forest for classification task">
```python
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_circles
from sklearn.cross_validation import train_test_split
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 10, 6
%matplotlib inline
np.random.seed(42)
X, y = make_circles(n_samples=500, factor=0.1, noise=0.35, random_state=42)
X_train_circles, X_test_circles, y_train_circles, y_test_circles = train_test_split(X, y, test_size=0.2)
dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train_circles, y_train_circles)
x_range = np.linspace(X.min(), X.max(), 100)
xx1, xx2 = np.meshgrid(x_range, x_range)
y_hat = dtree.predict(np.c_[xx1.ravel(), xx2.ravel()])
y_hat = y_hat.reshape(xx1.shape)
plt.contourf(xx1, xx2, y_hat, alpha=0.2)
plt.scatter(X[:,0], X[:,1], c=y, cmap='autumn')
plt.title("Дерево решений")
plt.show()
b_dtree = BaggingClassifier(DecisionTreeClassifier(),n_estimators=300, random_state=42)
b_dtree.fit(X_train_circles, y_train_circles)
x_range = np.linspace(X.min(), X.max(), 100)
xx1, xx2 = np.meshgrid(x_range, x_range)
y_hat = b_dtree.predict(np.c_[xx1.ravel(), xx2.ravel()])
y_hat = y_hat.reshape(xx1.shape)
plt.contourf(xx1, xx2, y_hat, alpha=0.2)
plt.scatter(X[:,0], X[:,1], c=y, cmap='autumn')
plt.title("Бэггинг(дерево решений)")
plt.show()
rf = RandomForestClassifier(n_estimators=300, random_state=42)
rf.fit(X_train_circles, y_train_circles)
x_range = np.linspace(X.min(), X.max(), 100)
xx1, xx2 = np.meshgrid(x_range, x_range)
y_hat = rf.predict(np.c_[xx1.ravel(), xx2.ravel()])
y_hat = y_hat.reshape(xx1.shape)
plt.contourf(xx1, xx2, y_hat, alpha=0.2)
plt.scatter(X[:,0], X[:,1], c=y, cmap='autumn')
plt.title("Random forest")
plt.show()
The figure above shows that the dividing boundary of the decision tree is very "jagged" and with a lot of sharp angles, which indicates overfitting and poor generalization capability. At the same time, bagging and random forest have quite smooth boundaries with little indication of overfitting.
Let us now try to understand the parameters by tuning which we can increase the proportion of correct answers.
Random forest method is implemented in machine learning library scikit-learn with two classes: RandomForestClassifier and RandomForestRegressor.
Full list of parameters for random forest regression problem:
class sklearn.ensemble.RandomForestRegressor(
n_estimators — the number of trees in the "forest" (by default - 10)
criterion — a function that measures the quality of split of a tree branch (by default - "mse", also "mae" can be selected)
max_features — the number of features on which a partition is sought. You can specify a certain number or percentage of features, or choose from the available values: "auto" (all features), "sqrt", "log2". Default value is "auto".
max_depth — the maximum depth of the tree (by default depth is not limited)
min_samples_split — minimum number of objects required for the split of the internal node. You can specify the number or percentage of the total number of objects (by default it's 2)
min_samples_leaf — the minimum number of objects in a leaf. Can be defined by the number or percentage of the total number of objects (by default it's 1)
min_weight_fraction_leaf — minimum weighted proportion of the total sum of weights (of all input objects) that should be in a leaf (by default, all have the same weight)
max_leaf_nodes — the maximum number of leaves (by default there is no limit)
min_impurity_split — threshold to stop building the tree (by default 1е-7)
bootstrap — whether to apply bootstrap to build a tree (by default True)
oob_score — whether to use the out-of-bag objects for evaluation of R^2 (by default False)
n_jobs — the number of cores to build the model and predictions (by default 1, if you put -1, then all cores will be used)
random_state — initial value for random number generation (by default there's none, if you want reproducible results, it is necessary to specify any number of int type)
verbose — log output of tree construction (by default 0)
warm_start — uses a pretrained model and adds the trees to the ensemble (by default False)
)
For classification problem it's almost all the same, we'll present only those parameters that are different from RandomForestRegressor
class sklearn.ensemble.RandomForestClassifier(
criterion — as we now have classification problem, by default "gini" is selected (also, "entropy" can be selected)
class_weight — weight of each class (by default all weights equal to 1, but you can pass a dictionary with weights, or explicitly specify "balanced" to make weight classes equal to their proportions in the general population; you can also specify "balanced_subsample", and then the weights on each subsample will vary depending on the distribution of classes on this subsample.
)
Next, let's consider several parameters to look at in the first place when building a model:
- n_estimators — the number of trees in the "forest"
- criterion — the criterion for splitting the sample at the node
- max_features — the number of features on which a partition is sought
- min_samples_leaf — the minimum number of objects in a leaf
- max_depth — the maximum depth of the tree
Let's consider the use of a random forest in a real problem
To do this, we will use the example of the problem of customer churn. This is a classification problem, so we will use the accuracy metric to assess the quality of the model. To start with, let's build the simplest classifier which will be our baseline. Let's take only numerical features for simplicity.
```python import pandas as pd from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV from sklearn.metrics import accuracy_scoredf = pd.read_csv("../../data/telecom_churn.csv")
cols = [] for i in df.columns: if (df[i].dtype == "float64") or (df[i].dtype == 'int64'): cols.append(i)
X, y = df[cols].copy(), np.asarray(df["Churn"],dtype='int8')
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rfc = RandomForestClassifier(random_state=42, n_jobs=-1, oob_score=True)
results = cross_val_score(rfc, X, y, cv=skf)
print("CV accuracy score: {:.2f}%".format(results.mean()*100))
</spoiler>
We've obtained a share of 91.21% correct answers, now we will try to improve this result and look at the behavior of the validation curves when changing the basic settings.
Let's start with the number of trees:
<spoiler title="Code for constructing the validation curves on selecting the number of trees">
```python
# Initialize validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Create lists to save accuracy on training and test sets
train_acc = []
test_acc = []
temp_train_acc = []
temp_test_acc = []
trees_grid = [5, 10, 15, 20, 30, 50, 75, 100]
# Train on the training set
for ntrees in trees_grid:
rfc = RandomForestClassifier(n_estimators=ntrees, random_state=42, n_jobs=-1, oob_score=True)
temp_train_acc = []
temp_test_acc = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
rfc.fit(X_train, y_train)
temp_train_acc.append(rfc.score(X_train, y_train))
temp_test_acc.append(rfc.score(X_test, y_test))
train_acc.append(temp_train_acc)
test_acc.append(temp_test_acc)
train_acc, test_acc = np.asarray(train_acc), np.asarray(test_acc)
print("Best accuracy on CV is {:.2f}% with {} trees".format(max(test_acc.mean(axis=1))*100,
trees_grid[np.argmax(test_acc.mean(axis=1))]))
fig, ax = plt.subplots(figsize=(8, 4)) ax.plot(trees_grid, train_acc.mean(axis=1), alpha=0.5, color='blue', label='train') ax.plot(trees_grid, test_acc.mean(axis=1), alpha=0.5, color='red', label='cv') ax.fill_between(trees_grid, test_acc.mean(axis=1) - test_acc.std(axis=1), test_acc.mean(axis=1) + test_acc.std(axis=1), color='#888888', alpha=0.4) ax.fill_between(trees_grid, test_acc.mean(axis=1) - 2test_acc.std(axis=1), test_acc.mean(axis=1) + 2test_acc.std(axis=1), color='#888888', alpha=0.2) ax.legend(loc='best') ax.set_ylim([0.88,1.02]) ax.set_ylabel("Accuracy") ax.set_xlabel("N_estimators")
</spoiler>
![image](https://github.com/Yorko/mlcourse_open/blob/master/img/rf_n_est.png?raw=true)
As you can seen, when a certain number of trees is reached, our share of correct answers on the test goes to the asymptote, and you can decide by yourselves how many trees is optimal for your problem.
The figure also shows that on the training set we were able to achieve 100% accuracy, it indicates overfitting. To avoid overfitting we need to add the regularization parameters to the model.
Let's start with the maximum depth parameter max_depth. (let's fix the number of trees at 100)
<spoiler title="Code for building learning curves on the selection of the maximum depth of the tree">
```python
# Create lists to save accuracy on training and test sets
train_acc = []
test_acc = []
temp_train_acc = []
temp_test_acc = []
max_depth_grid = [3, 5, 7, 9, 11, 13, 15, 17, 20, 22, 24]
# Train on the training set
for max_depth in max_depth_grid:
rfc = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, oob_score=True, max_depth=max_depth)
temp_train_acc = []
temp_test_acc = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
rfc.fit(X_train, y_train)
temp_train_acc.append(rfc.score(X_train, y_train))
temp_test_acc.append(rfc.score(X_test, y_test))
train_acc.append(temp_train_acc)
test_acc.append(temp_test_acc)
train_acc, test_acc = np.asarray(train_acc), np.asarray(test_acc)
print("Best accuracy on CV is {:.2f}% with {} max_depth".format(max(test_acc.mean(axis=1))*100,
max_depth_grid[np.argmax(test_acc.mean(axis=1))]))
max_depth
parameter copes well with the regularization of the model, and we do not overfit so much this time.Share of correct answers of our model slightly increased.
Another important parameter is min_samples_leaf
, it also performs the function of regularizor.
for min_samples_leaf in min_samples_leaf_grid: rfc = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, oob_score=True, min_samples_leaf=min_samples_leaf) temp_train_acc = [] temp_test_acc = [] for train_index, test_index in skf.split(X, y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y[train_index], y[test_index] rfc.fit(X_train, y_train) temp_train_acc.append(rfc.score(X_train, y_train)) temp_test_acc.append(rfc.score(X_test, y_test)) train_acc.append(temp_train_acc) test_acc.append(temp_test_acc)
train_acc, test_acc = np.asarray(train_acc), np.asarray(test_acc) print("Best accuracy on CV is {:.2f}% with {} min_samples_leaf".format(max(test_acc.mean(axis=1))*100, min_samples_leaf_grid[np.argmax(test_acc.mean(axis=1))]))
</spoiler>
<spoiler title="Code for plotting validation curves">
```python
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(min_samples_leaf_grid, train_acc.mean(axis=1), alpha=0.5, color='blue', label='train')
ax.plot(min_samples_leaf_grid, test_acc.mean(axis=1), alpha=0.5, color='red', label='cv')
ax.fill_between(min_samples_leaf_grid, test_acc.mean(axis=1) - test_acc.std(axis=1), test_acc.mean(axis=1) + test_acc.std(axis=1), color='#888888', alpha=0.4)
ax.fill_between(min_samples_leaf_grid, test_acc.mean(axis=1) - 2*test_acc.std(axis=1), test_acc.mean(axis=1) + 2*test_acc.std(axis=1), color='#888888', alpha=0.2)
ax.legend(loc='best')
ax.set_ylim([0.88,1.02])
ax.set_ylabel("Accuracy")
ax.set_xlabel("Min_samples_leaf")
In this case, we do not win in accuracy on validation, but we can greatly reduce overfitting up to 2%, while maintaining the accuracy of about 92%.
Let us consider a parameter called max_features
. For classification purposes the default value
for max_features in max_features_grid: rfc = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, oob_score=True, max_features=max_features) temp_train_acc = [] temp_test_acc = [] for train_index, test_index in skf.split(X, y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y[train_index], y[test_index] rfc.fit(X_train, y_train) temp_train_acc.append(rfc.score(X_train, y_train)) temp_test_acc.append(rfc.score(X_test, y_test)) train_acc.append(temp_train_acc) test_acc.append(temp_test_acc)
train_acc, test_acc = np.asarray(train_acc), np.asarray(test_acc) print("Best accuracy on CV is {:.2f}% with {} max_features".format(max(test_acc.mean(axis=1))*100, max_features_grid[np.argmax(test_acc.mean(axis=1))]))
</spoiler>
<spoiler title="Code for plotting validation curves">
```python
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(max_features_grid, train_acc.mean(axis=1), alpha=0.5, color='blue', label='train')
ax.plot(max_features_grid, test_acc.mean(axis=1), alpha=0.5, color='red', label='cv')
ax.fill_between(max_features_grid, test_acc.mean(axis=1) - test_acc.std(axis=1), test_acc.mean(axis=1) + test_acc.std(axis=1), color='#888888', alpha=0.4)
ax.fill_between(max_features_grid, test_acc.mean(axis=1) - 2*test_acc.std(axis=1), test_acc.mean(axis=1) + 2*test_acc.std(axis=1), color='#888888', alpha=0.2)
ax.legend(loc='best')
ax.set_ylim([0.88,1.02])
ax.set_ylabel("Accuracy")
ax.set_xlabel("Max_features")
In our case the optimal number of features is 10, this is the value where the best result is achieved.
We have examined the behavior of the validation curves depending on changes in the basic parameters. Let's now use GridSearchCV
to find optimal parameters for our example.
Let's write the variance of a random forest as
-
$inline$ \large \rho(x)$inline$ – is sample correlation between any two trees used in the averaging$$display$$ \large \rho(x) = corr[T(x;\Theta_1(Z)),T(x_2,\Theta_2(Z))],$$display$$ where$inline$ \large \Theta_1(Z)$inline$ and$inline$ \large \Theta_2(Z)$inline$ are randomly selected pair of trees on a randomly selected objects of sample$inline$ \large Z$inline$ -
$inline$ \large \sigma^2(x)$inline$ is the sample variance of any randomly selected tree:$$display$$ \large \sigma^2(x) = VarT(x;\Theta(X)$$display$$
It's easy to confuse
In fact, the conditional covariance of pairs of trees is equal to 0 because bootstrap and feature selection are independent and identically distributed.
If we consider the variance of one tree, it practically does not change depending on the variables for the split (
As in bagging, bias in a random forest is the same as bias in a single tree
Extremely Randomized Trees are more random in way that they compute partitions in the nodes. As in random forests, a random subset of possible features is used, but instead of searching the optimal threshold, the threshold values are selected at random for each possible feature, and the best of these randomly generated thresholds is selected as the best rule to split the node. This usually allows to slightly reduce dispersion of the model due to a slightly larger increase in bias.
In scikit-learn library there are implementations ExtraTreesClassifier and ExtraTreesRegressor.This method should be used when you strongly overfit with random forest or gradient boosting.
Random forest method is similar to the method of nearest neighbors. Random forests, in fact, carry out predictions for objects based on labels of similar objects from the training set. And the more often these objects appear in the same leaf, the greater their similarity. Let's show this formally.
Let's consider regression problem with a quadratic loss function. Let
The same book The Elements of Statistical Learning has a good illustrative example of similarity of random forest and k-nearest neighbors.
Everyone got used to applying random forest for supervised learning problems, but there's also a way to train the model without a teacher. Using the method RandomTreesEmbedding we can transform our dataset into a its sparse multi-dimensional representation.Its essence is that we are building a completely random trees, and an index of the leaf where an observation fell into is considered a new feature. If an object fell into the first leaf, we put 1, if not we put 0. The so-called binary encoding. We can control the number of variables and the degree of sparseness of our new representation of the dataset by increasing/decreasing the number of trees and their depth. Since the adjacent data points are likely to lie in one leaf, the transformation does an implicit nonparametric density estimate.
#3. Evaluation of the Feature Importance
Very often you want to understand your algorithm, why it responded this way and not the other. Or if not completely understand it, then at least find out which of all the variables have a greater impact on the result. One can quite easily obtain this information from random forest.
According to this picture, it is intuitively clear that the importance of the "Age" feature is greater than the importance of the "income" feature in the problem of credit scoring. It is formalized by means of the information gain concept..
If you build a lot of decision trees (random forest), then the higher in average the feature is in the decision tree, the more important it is in this problem of classification/regression. With each partition for each tree an improvement in the split criterion (in our case Gini impurity) is a measure of the importance that is associated with a partition feature, and accumulates in all the trees of the forest separately for each variable.
Let's get into the details. The average decrease in accuracy caused by the feature is determined during the calculation of out-of-bag error. The greater the accuracy of predictions decreases because of an exclusion (or rearrangement) of one feature, the more important this feature is, and therefore features with greater average reduction in accuracy are more important for classification of data. Mean reduction of Gini uncertainty (or mse in regression) is a measure of how each variable contributes to uniformity of leaves and nodes in the final random forest model. Each time when a single feature is used to split the node, Gini uncertainty is calculated for the child nodes and compared to the coefficient of the initial node. Gini uncertainty is a measure of uniformity from 0 (uniform) to 1 (heterogeneous). Changes in the value of the separation criteria are summed for each feature and normalized at the end of the computation. Features that lead to the nodes with a higher purity have a higher reduction in the Gini coefficient.
Now let's represent all of the above in the form of formulas.
Calculation of the feature importance in the ensemble:
— unnormalized
— normalized
Example
Let's consider the results of the survey of visitors of hostels from Booking.com and TripAdvisor.com. Features are the average grades on various factors (listed below), like staff, state of the rooms, etc. Target variable is hostel rank on the site.
```python from __future__ import division, print_function # Disable any warnings from Anaconda import warnings warnings.filterwarnings('ignore') %pylab inline import seaborn as sns # russian headres from matplotlib import rc font = {'family': 'Verdana', 'weight': 'normal'} rc('font', **font) import pandas as pd import numpy as np from sklearn.ensemble.forest import RandomForestRegressorhostel_data = pd.read_csv("../../data/hostel_factors.csv") features = {"f1":u"Staff", "f2":u"Reservation of the Hostel", "f3":u"Check-in and check-out of the hostel", "f4":u"Room Condition", "f5":u"Condition of the shared kitchen", "f6":u"Condition of the common space", "f7":u"Extras", "f8":u"General terms and convenience", "f9":u"price/quality", "f10":u"SCC"}
forest = RandomForestRegressor(n_estimators=1000, max_features=10, random_state=0)
forest.fit(hostel_data.drop(['hostel', 'rating'], axis=1), hostel_data['rating']) importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
num_to_plot = 10 feature_indices = [ind+1 for ind in indices[:num_to_plot]]
print("Feature ranking:")
for f in range(num_to_plot): print("%d. %s %f " % (f + 1, features["f"+str(feature_indices[f])], importances[indices[f]])) plt.figure(figsize=(15,5)) plt.title(u"The importance of the constructs") bars = plt.bar(range(num_to_plot), importances[indices[:num_to_plot]], color=([str(i/float(num_to_plot+1)) for i in range(num_to_plot)]), align="center") ticks = plt.xticks(range(num_to_plot), feature_indices) plt.xlim([-1, num_to_plot]) plt.legend(bars, [u''.join(features["f"+str(i)]) for i in feature_indices]);
</spoiler>
![image](https://github.com/Yorko/mlcourse_open/blob/master/img/rf_features_imp.png?raw=true)
The figure above shows that people are more likely to pay attention to personnel and the quality/price ratio, and write their reviews based on these things. But the difference between these features and less important features is not very significant, and discarding of some feature will reduce the accuracy of our model. But even on the basis of our analysis, we can make recommendations to hotels to better train personnel and/or to improve the quality of the declared price in the first place.
#4. Pros and Cons of Random Forest
**Pros**:
— it has a high prediction accuracy, and for most tasks it will perform better than linear algorithms; accuracy is comparable to the accuracy of boosting
— almost insensitive to outliers in the data due to random sampling
— insensitive to scaling (and in general to any monotonic transformations) of features, it is associated with the choice of the random subspaces
— does not require careful parameter tuning, works well "out of the box." By "tuning" parameters one can achieve a gain of 0.5 to 3% in accuracy, depending on the task and data
— it is able to efficiently process data with a large number of features and classes
— equally well processes both continuous and discrete variables
— is rarely overfitted; in practice, the addition of trees almost always only improves the composition, but in validation, after reaching a certain number of trees, the learning curve goes to the asymptote
— for random forest, there exist methods for evaluating the significance of individual features in the model
— works well with missing data; it maintains good accuracy even if most of the data is missing
— implies the possibility to balance the weight of each class either on the whole sample, or on the subsample for each tree
— calculates the proximity between pairs of objects that can be used for clustering, anomaly detection or (by scaling) provide an interesting representation of the data
— features described above may be extended to the untagged data, which leads to the possibility of making data visualizations and clustering to detect anomalies
— high parallelization and scalability.
**Cons**:
— as opposed to a single tree, the results of random forest are difficult to interpret
— no formal conclusions (p-values) are available to assess the importance of the variables
— algorithm is worse than many linear methods when the sample has many sparse features (texts, Bag of words)
— Random Forest can not extrapolate, in contrast to the linear regression (but this can be considered a plus, as in case of an outlier there won't be extreme values)
— the algorithm tends to overfit in some problems, especially with noisy data
— for data including categorical variables with different number of levels, random forests are biased in favor of features with many levels: when a feature has a lot of levels, the tree will try harder to adapt to this feature, because it can give a higher value of the optimized functional (like information gain)
— if the data contains groups of correlated features of similar importance to the labels, preference is given to small groups before large
— larger size of the resulting models. It takes $inline$\large O(NK) $inline$ memory for storing a model where $inline$\large K $inline$ is the number of trees.
#5. Homework
In this home [assignment](https://github.com/Yorko/mlcourse_open/blob/master/jupyter_notebooks/topic5_bagging_rf/hw5_logit_rf_credit_scoring.ipynb), we will solve the problem of credit scoring. We’ll see in practice, how and where to apply bootstrap, what are the advantages and disadvantages of logistic regression in comparison with random forest. We see also see in which cases bagging works better, and in which - worse. You can find answers to many questions if you read this article carefully.
Answers should be filled in in the [web-form](https://goo.gl/forms/wNLR2QJj0ZqQ7B9q1). You can get maximum 12 points for this assignment. Дедлайн – 3 апреля 23:59 UTC+3.
#6. Useful resources
– section 15 of the book “[Elements of Statistical Learning](https://statweb.stanford.edu/~tibs/ElemStatLearn/)” Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
– [Блог](https://alexanderdyakonov.wordpress.com/2016/11/14/случайный-лес-random-forest/) Александра Дьяконова
– more on practical applications of random forest and other composition algorithms in official documentation of [scikit-learn](http://scikit-learn.org/stable/modules/ensemble.html)
– [Курс](https://github.com/esokolov/ml-course-hse) Евгения Соколова по машинному обучению (материалы на GitHub). Есть дополнительные практические задания для углубления ваших знаний
– Обзорная [статья](https://www.researchgate.net/publication/278019662_Istoria_razvitia_ansamblevyh_metodov_klassifikacii_v_masinnom_obucenii) "История развития ансамблевых методов классификации в машинном обучении" (Ю. Кашницкий)
*Статья написана в соавторстве с @yorko (Юрием Кашницким). Автор домашнего задания – @vradchenko (Виталий Радченко). Благодарю @bauchgefuehl (Анастасию Манохину) за редактирование.*