The Impact of Bootstrap Sampling Rate on Random Forest Performance in Regression Tasks

Iwaniuk, Michał, Jarosz, Mateusz, Borycki, Bartłomiej, Jezierski, Bartosz, Cwalina, Jan, Kaźmierczak, Stanisław, Mańdziuk, Jacek

Nov-19-2025–arXiv.org Artificial Intelligence

Abstract--Random Forests (RFs) typically train each tree on a bootstrap sample of the same size as the training set, i.e., bootstrap rate (BR) equals 1.0. We systematically examine how varying BR from 0.2 to 5.0 affects RF performance across 39 heterogeneous regression datasets and 16 RF configurations, evaluating with repeated two-fold cross-validation and mean squared error . Our results demonstrate that tuning the BR can yield significant improvements over the default: the best setup relied on BR 1.0 for 24 datasets, BR > 1.0 for 15, and BR = 1.0 was optimal in 4 cases only. We establish a link between dataset characteristics and the preferred BR: datasets with strong global feature-target relationships favor higher BRs, while those with higher local target variance benefit from lower BRs. T o further investigate this relationship, we conducted experiments on synthetic datasets with controlled noise levels. These experiments reproduce the observed bias-variance trade-off: in low-noise scenarios, higher BRs effectively reduce model bias, whereas in high-noise settings, lower BRs help reduce model variance. Overall, BR is an influential hyperparameter that should be tuned to optimize RF regression models. ANDOM Forest (RF) is an ensemble machine learning (ML) algorithm involving a set of decision trees that collectively make a decision. In classification tasks, each tree votes for a particular class, and the predicted label is determined either by hard voting (majority vote) or soft voting (averaged class probabilities across the trees). In regression tasks, the final prediction is the mean of all individual tree outputs. RFs serve as a robust baseline across a wide range of ML problems, offering an effective balance of predictive accuracy, training speed, and moderate interpretability. While gradient-boosted trees or deep neural networks may outperform them in heavily tuned or domain-specific settings, RF models consistently deliver near-optimal results with minimal tuning, especially on structured, tabular datasets [1], [2].

artificial intelligence, decision tree learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Nov-19-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.67)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Transportation (0.67)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning (1.00)
  - Ensemble Learning (1.00)
  - Decision Tree Learning (1.00)
  - Performance Analysis > Accuracy (0.50)
  - Neural Networks > Deep Learning (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found