Goto

Collaborating Authors

 warm-starting neural network training



On Warm-Starting Neural Network Training

Neural Information Processing Systems

In many real-world deployments of machine learning systems, data arrive piecemeal. These learning scenarios may be passive, where data arrive incrementally due to structural properties of the problem (e.g., daily financial data) or active, where samples are selected according to a measure of their quality (e.g., experimental design). In both of these cases, we are building a sequence of models that incorporate an increasing amount of data. We would like each of these models in the sequence to be performant and take advantage of all the data that are available to that point. Conventional intuition suggests that when solving a sequence of related optimization problems of this form, it should be possible to initialize using the solution of the previous iterate---to ``warm start'' the optimization rather than initialize from scratch---and see reductions in wall-clock time. However, in practice this warm-starting seems to yield poorer generalization performance than models that have fresh random initializations, even though the final training losses are similar.


DASH: Warm-Starting Neural Network Training in Stationary Settings without Loss of Plasticity

Neural Information Processing Systems

Warm-starting neural network training by initializing networks with previously learned weights is appealing, as practical neural networks are often deployed under a continuous influx of new data. However, it often leads to loss of plasticity, where the network loses its ability to learn new information, resulting in worse generalization than training from scratch. This occurs even under stationary data distributions, and its underlying mechanism is poorly understood. We develop a framework emulating real-world neural network training and identify noise memorization as the primary cause of plasticity loss when warm-starting on stationary data. Motivated by this, we propose Direction-Aware SHrinking (DASH), a method aiming to mitigate plasticity loss by selectively forgetting memorized noise while preserving learned features.


Review for NeurIPS paper: On Warm-Starting Neural Network Training

Neural Information Processing Systems

Weaknesses: The paper is limited to evaluating on CIFAR/SVHN, and I worry that this phenomenon may not extend to other methods and tasks. Warm-starting .. in the context of the problem setup of the authors .. seems to be basically the same thing as fine-tuning with more-data. This phenomenon doesn't seem to be happening on more sophisticated computer-vision tasks, and finetuning from datasets like ImageNet leads to similar or better performance with much faster convergence. Although the label-space is different in many fine-tuning setups one can imagine extending the existing setup to cover common and more realistic problems. The paper is written to motivate the idea of re-using weights on for continual/online learning setting but splitting the datasets into 2 sets (training with 1 and fine-tuning with both) seems to me a little toyish and unconventional continual learning setting. In online / continual learning there is a distribution shift as the dataset enters, but the dataset seems to be randomly split meaning that on expectation the distribution of these 2 sets should be the same.


Review for NeurIPS paper: On Warm-Starting Neural Network Training

Neural Information Processing Systems

The paper reports an interesting phenomenon -- sometimes fine-tuning a pre-trained network does worse than training from scratch, even when pre-training and fine-tuning are performed on the same dataset. The authors propose a method to remedy this problem. The reviewers are on the fence about the paper, but acknowledge that's its an understudied area. Their main concern is lack of any theoretical insights and the method being a "trick". I believe that findings of this paper are going to be of interest to the community.


On Warm-Starting Neural Network Training

Neural Information Processing Systems

In many real-world deployments of machine learning systems, data arrive piecemeal. These learning scenarios may be passive, where data arrive incrementally due to structural properties of the problem (e.g., daily financial data) or active, where samples are selected according to a measure of their quality (e.g., experimental design). In both of these cases, we are building a sequence of models that incorporate an increasing amount of data. We would like each of these models in the sequence to be performant and take advantage of all the data that are available to that point. Conventional intuition suggests that when solving a sequence of related optimization problems of this form, it should be possible to initialize using the solution of the previous iterate---to warm start'' the optimization rather than initialize from scratch---and see reductions in wall-clock time. However, in practice this warm-starting seems to yield poorer generalization performance than models that have fresh random initializations, even though the final training losses are similar.