Variational Learning is Effective for Large Deep Networks

Shen, Yuesong, Daheim, Nico, Cong, Bai, Nickl, Peter, Marconi, Gian Maria, Bazan, Clement, Yokota, Rio, Gurevych, Iryna, Cremers, Daniel, Khan, Mohammad Emtiyaz, Möllenhoff, Thomas

arXiv.org Machine Learning 

Laplace (MacKay, 1992), which do not directly optimize the variational objective, even though they have variational We give extensive empirical evidence against the interpretations. Ideally, we want to know whether a direct common belief that variational learning is ineffective optimization of the objective can match the accuracy of for large neural networks. We show that Adam-like methods without any increase in the cost, while an optimizer called Improved Variational Online also yielding good weight-uncertainty to improve calibration, Newton (IVON) consistently matches or outperforms model averaging, knowledge transfer, etc. Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational In this paper, we present the Improved Variational Online costs are nearly identical to Adam but Newton (IVON) method, which adapts the method of Lin its predictive uncertainty is better. We show several et al. (2020) to large scale and obtains state-of-the-art accuracy new use cases of IVON where we improve and uncertainty at nearly identical cost as Adam. Figure 1 fine-tuning and model merging in Large Language shows some examples where, for training GPT-2 (773M Models, accurately predict generalization error, parameters) from scratch, IVON gives 0.4 reduction in validation and faithfully estimate sensitivity to data. We find perplexity over AdamW and, for ResNet-50 (25.6M overwhelming evidence in support of effectiveness parameters) on ImageNet, it gives around 2% more accurate of variational learning.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found