Immer, Alexander
Forecasting Whole-Brain Neuronal Activity from Volumetric Video
Immer, Alexander, Lueckmann, Jan-Matthis, Chen, Alex Bo-Yuan, Li, Peter H., Petkova, Mariela D., Iyer, Nirmala A., Dev, Aparna, Ihrke, Gudrun, Park, Woohyun, Petruncio, Alyson, Weigel, Aubrey, Korff, Wyatt, Engert, Florian, Lichtman, Jeff W., Ahrens, Misha B., Jain, Viren, Januszewski, Michał
Large-scale neuronal activity recordings with fluorescent calcium indicators are increasingly common, yielding high-resolution 2D or 3D videos. Traditional analysis pipelines reduce this data to 1D traces by segmenting regions of interest, leading to inevitable information loss. Inspired by the success of deep learning on minimally processed data in other domains, we investigate the potential of forecasting neuronal activity directly from volumetric videos. To capture long-range dependencies in high-resolution volumetric whole-brain recordings, we design a model with large receptive fields, which allow it to integrate information from distant regions within the brain. We explore the effects of pre-training and perform extensive model selection, analyzing spatio-temporal trade-offs for generating accurate forecasts. Our model outperforms trace-based forecasting approaches on ZAPBench, a recently proposed benchmark on whole-brain activity prediction in zebrafish, demonstrating the advantages of preserving the spatial structure of neuronal activity.
Influence Functions for Scalable Data Attribution in Diffusion Models
Mlodozeniec, Bruno, Eschenhagen, Runa, Bae, Juhan, Immer, Alexander, Krueger, David, Turner, Richard
Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by developing an influence functions framework. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we systematically develop K-FAC approximations based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We recast previously proposed methods as specific design choices in our framework and show that our recommended method outperforms previous data attribution approaches on common evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.
Uncertainty-Penalized Direct Preference Optimization
Houliston, Sam, Pace, Alizée, Immer, Alexander, Rätsch, Gunnar
Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, contextdependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses. Aligning LLMs to human preferences in content, style, and presentation has become a central challenge in improving and deploying LLMs, leading to the advent of Reinforcement Learning with Human Feedback (RLHF), now a prominent technique to fine-tune state-of-the-art LLMs (Casper et al., 2023). The standard RLHF pipeline involves human feedback collection, reward model training, and LLM policy optimization via reinforcement learning (RL). Despite its success, each stage presents challenges, from feedback interpretation and policy generalization to challenging RL implementation (Casper et al., 2023). Direct Preference Optimisation (DPO) (Rafailov et al., 2023) effectively bypasses the reward model by fine-tuning the policy to maximize the likelihood of the preference data under the Bradley-Terry model (A. DPO is easier to implement than RL algorithms, and benefits from computational efficiency and stability by avoiding potential inaccuracies and biases of a reward model (Xu et al., 2024; Casper et al., 2023).
Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood
Dhahri, Rayen, Immer, Alexander, Charpentier, Betrand, Günnemann, Stephan, Fortuin, Vincent
Neural network sparsification is a promising avenue to save computational time and memory costs, especially in an age where many successful AI models are becoming too large to na\"ively deploy on consumer hardware. While much work has focused on different weight pruning criteria, the overall sparsifiability of the network, i.e., its capacity to be pruned without quality loss, has often been overlooked. We present Sparsifiability via the Marginal likelihood (SpaM), a pruning framework that highlights the effectiveness of using the Bayesian marginal likelihood in conjunction with sparsity-inducing priors for making neural networks more sparsifiable. Our approach implements an automatic Occam's razor that selects the most sparsifiable model that still explains the data well, both for structured and unstructured sparsification. In addition, we demonstrate that the pre-computed posterior Hessian approximation used in the Laplace approximation can be re-used to define a cheap pruning criterion, which outperforms many existing (more expensive) approaches. We demonstrate the effectiveness of our framework, especially at high sparsity levels, across a range of different neural network architectures and datasets.
Position Paper: Bayesian Deep Learning in the Age of Large-Scale AI
Papamarkou, Theodore, Skoularidou, Maria, Palla, Konstantina, Aitchison, Laurence, Arbel, Julyan, Dunson, David, Filippone, Maurizio, Fortuin, Vincent, Hennig, Philipp, Lobato, Jose Miguel Hernandez, Hubin, Aliaksandr, Immer, Alexander, Karaletsos, Theofanis, Khan, Mohammad Emtiyaz, Kristiadi, Agustinus, Li, Yingzhen, Mandt, Stephan, Nemeth, Christopher, Osborne, Michael A., Rudner, Tim G. J., Rügamer, David, Teh, Yee Whye, Welling, Max, Wilson, Andrew Gordon, Zhang, Ruqi
In the current landscape of deep learning research, there is a predominant emphasis on achieving high predictive accuracy in supervised tasks involving large image and language datasets. However, a broader perspective reveals a multitude of overlooked metrics, tasks, and data types, such as uncertainty, active and continual learning, and scientific data, that demand attention. Bayesian deep learning (BDL) constitutes a promising avenue, offering advantages across these diverse settings. This paper posits that BDL can elevate the capabilities of deep learning. It revisits the strengths of BDL, acknowledges existing challenges, and highlights some exciting research avenues aimed at addressing these obstacles. Looking ahead, the discussion focuses on possible ways to combine large-scale foundation models with BDL to unlock their full potential.
Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures
Eschenhagen, Runa, Immer, Alexander, Turner, Richard E., Schneider, Frank, Hennig, Philipp
The core components of many modern neural network architectures, such as transformers, convolutional, or graph neural networks, can be expressed as linear layers with $\textit{weight-sharing}$. Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimisation method, has shown promise to speed up neural network training and thereby reduce computational costs. However, there is currently no framework to apply it to generic architectures, specifically ones with linear weight-sharing layers. In this work, we identify two different settings of linear weight-sharing layers which motivate two flavours of K-FAC -- $\textit{expand}$ and $\textit{reduce}$. We show that they are exact for deep linear networks with weight-sharing in their respective setting. Notably, K-FAC-reduce is generally faster than K-FAC-expand, which we leverage to speed up automatic hyperparameter selection via optimising the marginal likelihood for a Wide ResNet. Finally, we observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer. However, both variations are able to reach a fixed validation metric target in $50$-$75\%$ of the number of steps of a first-order reference run, which translates into a comparable improvement in wall-clock time. This highlights the potential of applying K-FAC to modern neural network architectures.
Uncertainty in Graph Contrastive Learning with Bayesian Neural Networks
Möllers, Alexander, Immer, Alexander, Isufi, Elvin, Fortuin, Vincent
Graph contrastive learning has shown great promise when labeled data is scarce, but large unlabeled datasets are available. However, it often does not take uncertainty estimation into account. We show that a variational Bayesian neural network approach can be used to improve not only the uncertainty estimates but also the downstream performance on semi-supervised node-classification tasks. Moreover, we propose a new measure of uncertainty for contrastive learning, that is based on the disagreement in likelihood due to different positive samples.
Improving Neural Additive Models with Bayesian Principles
Bouchiat, Kouroche, Immer, Alexander, Yèche, Hugo, Rätsch, Gunnar, Fortuin, Vincent
Neural additive models (NAMs) can improve the interpretability of deep neural networks by handling input features in separate additive sub-networks. However, they lack inherent mechanisms that provide calibrated uncertainties and enable selection of relevant features and interactions. Approaching NAMs from a Bayesian perspective, we enhance them in three primary ways, namely by a) providing credible intervals for the individual additive sub-networks; b) estimating the marginal likelihood to perform an implicit selection of features via an empirical Bayes procedure; and c) enabling a ranking of feature pairs as candidates for second-order interaction in fine-tuned models. In particular, we develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical performance on tabular datasets and challenging real-world medical tasks.
Learning Layer-wise Equivariances Automatically using Gradients
van der Ouderaa, Tycho F. A., Immer, Alexander, van der Wilk, Mark
However, symmetries provide fixed hard constraints on the functions a network can represent, need to be specified in advance, and can not be adapted. Our goal is to allow flexible symmetry constraints that can automatically be learned from data using gradients. Learning symmetry and associated weight connectivity structures from scratch is difficult for two reasons. First, it requires efficient and flexible parameterisations of layer-wise equivariances. Secondly, symmetries act as constraints and are therefore not encouraged by training losses measuring data fit. To overcome these challenges, we improve parameterisations of soft equivariance and learn the amount of equivariance in layers by optimising the marginal likelihood, estimated using differentiable Laplace approximations. The objective balances data fit and model complexity enabling layer-wise symmetry discovery in deep networks. We demonstrate the ability to automatically learn layer-wise equivariances on image classification tasks, achieving equivalent or improved performance over baselines with hard-coded symmetry.
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion
Meterez, Alexandru, Joudaki, Amir, Orabona, Francesco, Immer, Alexander, Rätsch, Gunnar, Daneshmand, Hadi
Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth. Motivated by these two aspects of batch normalization, in this study we pose the following question: "Can a batch-normalized network keep the optimal signal propagation properties, but avoid exploding gradients?" We answer this question in the affirmative by giving a particular construction of an Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded gradients at any depth. Based on Weingarten calculus, we develop a rigorous and non-asymptotic theory for this constructed MLP that gives a precise characterization of forward signal propagation, while proving that gradients remain bounded for linearly independent input samples, which holds in most practical settings. Inspired by our theory, we also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.