No Free Lunch From Random Feature Ensembles
Ruben, Benjamin S., Tong, William L., Chaudhry, Hamza Tahir, Pehlevan, Cengiz
Given a budget on total model size, one must decide whether to train a single, large neural network or to combine the predictions of many smaller networks. We study this trade-off for ensembles of random-feature ridge regression models. We prove that when a fixed number of trainable parameters are partitioned among K independently trained models, K = 1 achieves optimal performance, provided the ridge parameter is optimally tuned. We then derive scaling laws which describe how the test risk of an ensemble of regression models decays with its total size. We identify conditions on the kernel and task eigenstructure under which ensembles can achieve near-optimal scaling laws. Training ensembles of deep convolutional neural networks on CIFAR-10 and a transformer architecture on C4, we find that a single large network outperforms any ensemble of networks with the same total number of parameters, provided the weight decay and feature-learning strength are tuned to their optimal values. Ensembling methods are a well-established tool in machine learning for reducing the variance of learned predictors. While traditional ensemble approaches like random forests (Breiman, 2001) and XGBoost (Chen & Guestrin, 2016) combine many weak predictors, the advent of deep neural networks has shifted the state of the art toward training a single large predictor (LeCun et al., 2015). However, deep neural networks still suffer from various sources of variance, such as finite datasets and random initialization (Atanasov et al., 2022; Adlam & Pennington, 2020; Lin & Dobriban, 2021; Atanasov et al., 2024).
Dec-6-2024
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Technology: