A random forest is an ensemble of decision trees. Like other machine-learning techniques, random forests use training data to learn to make predictions.
One of the drawbacks of learning with a single tree is the problem of overfitting. Single trees tend to learn the training data too well, resulting in poor prediction performance on unseen data. This is also known as variance and results in a model that is sensitive to small changes in the training data. Although various techniques (pruning, early stoppingand minimum split size) can mitigate tree overfitting, random forests take a different approach.
Random forests use a variation of bagging whereby many independent trees are learned from the same training data. A forest typically contains several hundred trees.
Comparison to single decision trees
There are three main areas that differentiate the training of random forests from single trees:
- The training data for each tree is created by sampling from the full data set with replacement.
- Only a subset of variables is considered when deciding how to split each node.
- Random forest trees are trained until the leaf nodes contain one or very few samples.
When classifying outputs, the prediction of the forest is the most common prediction of the individual trees. For regression, the forest prediction is the average of the individual trees.
Disadvantages of random forests
- Although random forests can be an improvement on single decision trees, more sophisticated techniques are available. Prediction accuracy on complex problems is usually inferior to gradient-boosted trees.
- A forest is less interpretable than a single decision tree. Single trees may be visualized as a sequence of decisions.
- A trained forest may require significant memory for storage, due to the need for retaining the information from several hundred individual trees.
Advantages of random forests
- Works well “out of the box” without tuning any parameters. Other models may have settings that require significant experimentation to find the best values.
- Tend not to overfit. The processes of randomizing the data and variables across many trees means that no single tree sees all the data. This helps to focus on the general patterns within the training data and reduce sensitivity to noise.
- Ability to handle non-linear numeric and categorical predictors and outcomes. Other models may require numeric inputs or assume linearity.
- Accuracy calculated from out-of-bag samples is a proxy for using a separate test data set. The out-of-bag samples are those not used for training a specific tree and as such can be used as an unbiased measure of performance.
- Predictor variable importance can be calculated. For more information, see “How is Variable Importance Calculated for Random Forests?”