Tang, Cheng, Garreau, Damien, Luxburg, Ulrike von
Random forests are learning algorithms that build large collections of random trees and make predictions by averaging the individual tree predictions. In this paper, we consider various tree constructions and examine how the choice of parameters affects the generalization error of the resulting random forests as the sample size goes to infinity. We show that subsampling of data points during the tree construction phase is important: Forests can become inconsistent with either no subsampling or too severe subsampling. As a consequence, even highly randomized trees can lead to inconsistent forests if no subsampling is used, which implies that some of the commonly used setups for random forests can be inconsistent. As a second consequence we can show that trees that have good performance in nearest-neighbor search can be a poor choice for random forests.
Tang, Cheng, Garreau, Damien, Luxburg, Ulrike von
Random forests are learning algorithms that build large collections of random trees and make predictions by averaging the individual tree predictions. In this paper, we consider various tree constructions and examine how the choice of parameters affects the generalization error of the resulting random forests as the sample size goes to infinity. We show that subsampling of data points during the tree construction phase is important: Forests can become inconsistent with either no subsampling or too severe subsampling. As a consequence, even highly randomized trees can lead to inconsistent forests if no subsampling is used, which implies that some of the commonly used setups for random forests can be inconsistent. As a second consequence we can show that trees that have good performance in nearest-neighbor search can be a poor choice for random forests.
The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad-hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.
This paper is a comment on the survey paper by Biau and Scornet (2016) about random forests. We focus on the problem of quantifying the impact of each ingredient of random forests on their performance. We show that such a quantification is possible for a simple pure forest , leading to conclusions that could apply more generally. Then, we consider "hold-out" random forests, which are a good middle point between "toy" pure forests and Breiman's original random forests.
Li, Yiming, Bai, Jiawang, Tang, Qingtao, Jiang, Yong, Li, Chun, Xia, Shutao
Random forests (RF) are one of the most widely used ensemble learning methods in classification and regression tasks. Despite its impressive performance, its theoretical consistency, which would ensure that its result converges to the optimum as the sample size increases, has been left far behind. Several consistent random forest variants have been proposed, yet all with relatively poor performance compared to the original random forests. In this paper, a novel RF framework named multinomial random forests (MRF) is proposed. In the MRF, an impurity-based multinomial distribution is constructed as the basis for the selection of a splitting point. This ensures that a certain degree of randomness is achieved while the overall quality of the trees is not much different from the original random forests. We prove the consistency of the MRF and demonstrate with multiple datasets that it performs similarly as the original random forests and better than existent consistent random forest variants for both classification and regression tasks.