Möllenhoff, Thomas
Uncertainty-Aware Decoding with Minimum Bayes Risk
Daheim, Nico, Meister, Clara, Möllenhoff, Thomas, Gurevych, Iryna
Despite their outstanding performance in the majority of scenarios, contemporary language models still occasionally generate undesirable outputs, for example, hallucinated text. While such behaviors have previously been linked to uncertainty, there is a notable lack of methods that actively consider uncertainty during text generation. In this work, we show how Minimum Bayes Risk (MBR) decoding, which selects model generations according to an expected risk, can be generalized into a principled uncertainty-aware decoding method. In short, we account for model uncertainty during decoding by incorporating a posterior over model parameters into MBR's computation of expected risk. We show that this modified expected risk is useful for both choosing outputs and deciding when to abstain from generation and can provide improvements without incurring overhead. We benchmark different methods for learning posteriors and show that performance improves with prediction diversity. We release our code publicly.
Natural Variational Annealing for Multimodal Optimization
Minh, Tâm Le, Arbel, Julyan, Möllenhoff, Thomas, Khan, Mohammad Emtiyaz, Forbes, Florence
We introduce a new multimodal optimization approach called Natural Variational Annealing (NVA) that combines the strengths of three foundational concepts to simultaneously search for multiple global and local modes of black-box nonconvex objectives. First, it implements a simultaneous search by using variational posteriors, such as, mixtures of Gaussians. Second, it applies annealing to gradually trade off exploration for exploitation. Finally, it learns the variational search distribution using natural-gradient learning where updates resemble well-known and easy-to-implement algorithms. The three concepts come together in NVA giving rise to new algorithms and also allowing us to incorporate "fitness shaping", a core concept from evolutionary algorithms. We assess the quality of search on simulations and compare them to methods using gradient descent and evolution strategies. We also provide an application to a real-world inverse problem in planetary science.
How to Weight Multitask Finetuning? Fast Previews via Bayesian Model-Merging
Maldonado, Hugo Monzón, Möllenhoff, Thomas, Daheim, Nico, Gurevych, Iryna, Khan, Mohammad Emtiyaz
When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task separately (no retraining required). To improve the quality of previews, we propose a Bayesian approach to design new merging strategies by using more flexible posteriors. We validate our findings on vision and natural-language transformers. Our work shows the benefits of model merging via Bayes to improve multitask finetuning.
Variational Low-Rank Adaptation Using IVON
Cong, Bai, Daheim, Nico, Shen, Yuesong, Cremers, Daniel, Yokota, Rio, Khan, Mohammad Emtiyaz, Möllenhoff, Thomas
We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. The accuracy is also better than the other Bayesian alternatives, yet the cost is lower and the implementation is easier. Our work provides additional evidence for the effectiveness of IVON for large language models.
Conformal Prediction via Regression-as-Classification
Guha, Etash, Natarajan, Shlok, Möllenhoff, Thomas, Khan, Mohammad Emtiyaz, Ndiaye, Eugene
Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals.~Here, we circumvent the challenges by converting regression to a classification problem and then use CP for classification to obtain CP sets for regression.~To preserve the ordering of the continuous-output space, we design a new loss function and make necessary modifications to the CP classification techniques.~Empirical results on many benchmarks shows that this simple approach gives surprisingly good results on many practical problems.
Variational Learning is Effective for Large Deep Networks
Shen, Yuesong, Daheim, Nico, Cong, Bai, Nickl, Peter, Marconi, Gian Maria, Bazan, Clement, Yokota, Rio, Gurevych, Iryna, Cremers, Daniel, Khan, Mohammad Emtiyaz, Möllenhoff, Thomas
Laplace (MacKay, 1992), which do not directly optimize the variational objective, even though they have variational We give extensive empirical evidence against the interpretations. Ideally, we want to know whether a direct common belief that variational learning is ineffective optimization of the objective can match the accuracy of for large neural networks. We show that Adam-like methods without any increase in the cost, while an optimizer called Improved Variational Online also yielding good weight-uncertainty to improve calibration, Newton (IVON) consistently matches or outperforms model averaging, knowledge transfer, etc. Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational In this paper, we present the Improved Variational Online costs are nearly identical to Adam but Newton (IVON) method, which adapts the method of Lin its predictive uncertainty is better. We show several et al. (2020) to large scale and obtains state-of-the-art accuracy new use cases of IVON where we improve and uncertainty at nearly identical cost as Adam. Figure 1 fine-tuning and model merging in Large Language shows some examples where, for training GPT-2 (773M Models, accurately predict generalization error, parameters) from scratch, IVON gives 0.4 reduction in validation and faithfully estimate sensitivity to data. We find perplexity over AdamW and, for ResNet-50 (25.6M overwhelming evidence in support of effectiveness parameters) on ImageNet, it gives around 2% more accurate of variational learning.
SAM as an Optimal Relaxation of Bayes
Möllenhoff, Thomas, Khan, Mohammad Emtiyaz
Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.
The Memory Perturbation Equation: Understanding Model's Sensitivity to Data
Nickl, Peter, Xu, Lu, Tailor, Dharmesh, Möllenhoff, Thomas, Khan, Mohammad Emtiyaz
Understanding model's sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model's sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of models and algorithms, and unravels useful properties regarding sensitivities. Our empirical results show that sensitivity estimates obtained during training can be used to faithfully predict generalization on unseen test data. The proposed equation is expected to be useful for future research on robust and adaptive learning.
Model Merging by Uncertainty-Based Gradient Matching
Daheim, Nico, Möllenhoff, Thomas, Ponti, Edoardo Maria, Gurevych, Iryna, Khan, Mohammad Emtiyaz
Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters.
The Lie-Group Bayesian Learning Rule
Kıral, Eren Mehmet, Möllenhoff, Thomas, Khan, Mohammad Emtiyaz
The Bayesian Learning Rule provides a framework for generic algorithm design but can be difficult to use for three reasons. First, it requires a specific parameterization of exponential family. Second, it uses gradients which can be difficult to compute. Third, its update may not always stay on the manifold. We address these difficulties by proposing an extension based on Lie-groups where posteriors are parametrized through transformations of an arbitrary base distribution and updated via the group's exponential map. This simplifies all three difficulties for many cases, providing flexible parametrizations through group's action, simple gradient computation through reparameterization, and updates that always stay on the manifold. We use the new learning rule to derive a new algorithm for deep learning with desirable biologically-plausible attributes to learn sparse features. Our work opens a new frontier for the design of new algorithms by exploiting Lie-group structures.