Decision trees are a simple and powerful predictive modeling technique, but they suffer from high-variance.
This means that trees can get very different results given different training data.
A technique to make decision trees more robust and to achieve better performance is called bootstrap aggregation or bagging for short.
In this tutorial, you will discover how to implement the bagging procedure with decision trees from scratch with python.
After completing this tutorial, you will know:
How to create a bootstrap sample of your dataset. How to make predictions with bootstrapped models. How to apply bagging to your own predictive modeling problems.Let’s get started.

How to Implement Bagging From Scratch With Python
Photo by Michael Cory , some rights reserved.
DescriptionsThis section provides a brief description to Bootstrap Aggregation and the Sonar dataset that will be used in this tutorial.
Bootstrap Aggregation AlgorithmA bootstrap is a sample of a dataset with replacement.
This means that a new dataset is created from a random sample of an existing dataset where a given row may be selected and added more than once to the sample.
It is a useful approach to use when estimating values such as the mean for a broader dataset, when you only have a limited dataset available. By creating samples of your dataset and estimating the mean from those samples, you can take the average of those estimates and get a better idea of the true mean of the underlying problem.
This same approach can be used with machine learning algorithms that have a high variance, such as decision trees. A separate model is trained on each bootstrap sample of data and the average output of those models used to make predictions. This technique is called bootstrap aggregation or bagging for short.
Variance means that an algorithm’s performance is sensitive to the training data, with high variance suggesting that the more the training data is changed, the more the performance of the algorithm will vary.
The performance of high variance machine learning algorithms like unpruned decision trees can be improved by training many trees and taking the average of their predictions. Results are often better than a single decision tree.
Another benefit of bagging in addition to improved performance is that the bagged decision trees cannot overfit the problem. Trees can continue to be added until a maximum in performance is achieved.
Sonar DatasetThe dataset we will use in this tutorial is the Sonar dataset.
This is a dataset that describes sonar chirp returns bouncing off different surfaces. The 60 input variables are the strength of the returns at different angles. It is a binary classification problem that requires a model to differentiate rocks from metal cylinders. There are 208 observations.
It is a well-understood dataset. All of the variables are continuous and generally in the range of 0 to 1. The output variable is a string “M” for mine and “R” for rock, which will need to be converted to integers 1 and 0.
By predicting the class with the most observations in the dataset (M or mines) the Zero Rule Algorithm can achieve an accuracy of 53%.
You can learn more about this dataset at the UCI Machine Learning repository .
Download the dataset for free and place it in your working directory with the filename sonar.all-data.csv .
TutorialThis tutorial is broken down into 2 parts:
Bootstrap Resample. Sonar Dataset Case Study.These steps provide the foundation that you need to implement and apply bootstrap aggregation with decision trees to your own predictive modeling problems.
1. Bootstrap ResampleLet’s start off by getting a strong idea of how the bootstrap method works.
We can create a new sample of a dataset by randomly selecting rows from the dataset and adding them to a new list. We can repeat this for a fixed number of rows or until the size of the new dataset matches a ratio of the size of the original dataset.
We can allow sampling with replacement by not removing the row that was selectedso that it is available for future selections.
Below is a function named subsample() that implements this procedure. The randrange() function from the random module is used to select a random row index to add to the sample each iteration of the loop. The default size of the sample is the size of the original dataset.
# Create a random subsample from the dataset with replacement defsubsample(dataset, ratio=1.0): sample = list() n_sample = round(len(dataset) * ratio) while len(sample) < n_sample: index = randrange(len(dataset)) sample.append(dataset[index]) return sampleWe can use this function to estimate the mean of a contrived dataset.
First, we can create a dataset with 20 rows and a single column of random numbers between 0 and 9 and calculate the mean value.
We can then make bootstrap samples of the original dataset, calculate the mean, and repeat this process until we have a list of means. Taking the average of these sample means should give us a robust estimate of the mean of the entire dataset.
The complete example is listed below.
Each bootstrap sample is created as a 10% sample of the original 20 observation dataset (or 2 observations). We then experiment by creating 1, 10, 100 bootstrap samples of the original dataset, calculate their mean value, then average all of those estimated mean values.
fromrandomimportseed fromrandomimportrandom fromrandomimportrandrange # Create a random subsample from the dataset with replacement defsubsample(dataset, ratio=1.0): sample = list() n_sample = round(len(dataset) * ratio) while len(sample) < n_sample: index = randrange(len(dataset)) sample.append(dataset[index]) return sample # Calculate the mean of a list of numbers defmean(numbers): return sum(numbers) / float(len(numbers)) seed(1) # True mean dataset = [[randrange(10)] for i in range(20)] print('True Mean: %.3f' % mean([row[0] for rowin dataset])) # Estimated means ratio = 0.10 for sizein [1, 10, 100]: sample_means = list() for i in range(size): sample = subsample(dataset, ratio) sample_mean = mean([row[0] for rowin sample]) sample_means.append(sample_mean) print('Samples=%d, Estimated Mean: %.3f' % (size, mean(sample_means)))Running the exampleprints the original mean value we aim to estimate.
We can then see the estimated mean from the various different numbers of bootstrap samples. We can see that with 100 samples we achieve a good estimate of the mean.
True Mean: 4.450 Samples=1, Estimated Mean: 4.500 Samples=10, Estimated Mean: 3.300 Samples=100, Estimated Mean: 4.480Instead of calculating the mean value, we can create a model from each subsample.
Next, let’s see how we can combine the predictions from multiple bootstrap models.
2. Sonar Dataset Case StudyIn this section, we will apply the Random Forest algorithm to the Sonar dataset.
The example assumes that a CSV copy of the dataset is in the current working directory with the file name sonar.all-data.csv .
The dataset is first loaded, the string values converted to numeric and the output column is converted from strings to the integer