Bayesian Learning
Bayesian Low-rank Adaptation for Large Language Models
Yang, Adam X., Robeyns, Maxime, Wang, Xi, Aitchison, Laurence
Low-rank adaptation (LoRA) has emerged as a new paradigm for cost-efficient finetuning of large language models (LLMs). However, fine-tuned LLMs often become overconfident especially when fine-tuned on small datasets. Bayesian methods, with their inherent ability to estimate uncertainty, serve as potent tools to mitigate overconfidence and enhance calibration. In this work, we introduce Laplace-LoRA, which applies a Bayesian approach to the LoRA parameters. Specifically, Laplace-LoRA applies a Laplace approximation to the posterior over the LoRA parameters, considerably improving the calibration of fine-tuned LLMs. In recent years, fine-tuning large language models (LLMs) have become increasingly important (Houlsby et al., 2019; Hu et al., 2021; Liu et al., 2022; Ding et al., 2022; 2023). Fine-tuning is used both to adapt LLMs for specific tasks and to create general instruction-following models (e.g. using Reinforcement Learning from Human Feedback; RLHF Wei et al., 2021; Ouyang et al., 2022; Chung et al., 2022; Wang et al., 2022). However, fine-tuned LLMs have a notable limitation: they often exhibit overconfidence (Jiang et al., 2021; Xiao et al., 2022; He et al., 2023; Tian et al., 2023; OpenAI, 2023). This is particularly problematic in safety-critical applications or when making decisions in areas where limited data is available, such as medical diagnosis, finance and experimental design (Singhal et al., 2022; Wu et al., 2023; Lampinen et al., 2023; Li et al., 2022). Consequently, there is an urgent need for strategies that enhance the calibration of fine-tuned LLMs, ensuring that their predictions are as trustworthy as they are powerful. Bayesian deep learning is commonly proposed as a solution to overconfidence in deep networks (e.g. Historically, the field of Bayesian deep learning has frequently considered ResNets for image classification (Shridhar et al., 2019; Dusenberry et al., 2020; Izmailov et al., 2021).
The last Dance : Robust backdoor attack via diffusion models and bayesian approach
Diffusion models are state-of-the-art deep learning generative models that are trained on the principle of learning forward and backward diffusion processes via the progressive addition of noise and denoising. In this paper, we seek to trick audio-based DNN models, such as those in the Hugging Face framework, for example, those that focus on audio, in particular transformer-based artificial intelligence models, which are powerful machine learning models that save time and deliver faster, more efficient results. We demonstrate the feasibility of backdoor attacks (called `BacKBayDiffMod`) on audio transformers derived from Hugging Face, a popular framework in the world of artificial intelligence (AI) research. The backdoor attack developed in this paper is based on poisoning the model's training data by incorporating backdoor diffusion sampling and a Bayesian approach to the distribution of poisoned data.
Estimating the Local Learning Coefficient at Scale
The \textit{local learning coefficient} (LLC) is a principled way of quantifying model complexity, originally derived in the context of Bayesian statistics using singular learning theory (SLT). Several methods are known for numerically estimating the local learning coefficient, but so far these methods have not been extended to the scale of modern deep learning architectures or data sets. Using a method developed in {\tt arXiv:2308.12108 [stat.ML]} we empirically show how the LLC may be measured accurately and self-consistently for deep linear networks (DLNs) up to 100M parameters. We also show that the estimated LLC has the rescaling invariance that holds for the theoretical quantity.
Bayesian Factorised Granger-Causal Graphs For Multivariate Time-series Data
We study the problem of automatically discovering Granger causal relations from observational multivariate time-series data. Vector autoregressive (VAR) models have been time-tested for this problem, including Bayesian variants and more recent developments using deep neural networks. Most existing VAR methods for Granger causality use sparsity-inducing penalties/priors or post-hoc thresholds to interpret their coefficients as Granger causal graphs. Instead, we propose a new Bayesian VAR model with a hierarchical graph prior over binary Granger causal graphs, separately from the VAR coefficients. We develop an efficient algorithm to infer the posterior over binary Granger causal graphs. Our method provides better uncertainty quantification, has less hyperparameters, and achieves better performance than competing approaches, especially on sparse multivariate time-series data.
Improved prediction of future user activity in online A/B testing
Masoero, Lorenzo, Beraha, Mario, Richardson, Thomas, Favaro, Stefano
In online randomized experiments or A/B tests, accurate predictions of participant inclusion rates are of paramount importance. These predictions not only guide experimenters in optimizing the experiment's duration but also enhance the precision of treatment effect estimates. In this paper we present a novel, straightforward, and scalable Bayesian nonparametric approach for predicting the rate at which individuals will be exposed to interventions within the realm of online A/B testing. Our approach stands out by offering dual prediction capabilities--it forecasts both the quantity of new customers expected in future time windows and, unlike available alternative methods, the number of times they will be observed. We derive closedform expressions for the posterior distributions of the quantities needed to form predictions about future user activity, thereby bypassing the need for numerical algorithms such as Markov chain Monte Carlo. After a comprehensive exposition of our model, we test its performance on experiments on real and simulated data, where we show its superior performance with respect to existing alternatives in the literature. 1 Introduction The problem of predicting the size of a population from which random samples are drawn has a long history in the statistics literature. Originally motivated by applications in ecology, where the goal is typically to determine the number of distinct species of animals within a population (Fisher et al., 1943; Good, 1953; Burnham and Overton, 1979), a variation of this problem has recently received considerable attention also in the genomics literature, where scientists are interested in predicting the number of future rare variants to be observed within a genomic study (Ionita-Laza et al., 2009; Zou et al., 2016; Chakraborty et al., 2019; Masoero et al., 2022).
FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion
Han, Xing, Nguyen, Huy, Harris, Carl, Ho, Nhat, Saria, Suchi
As machine learning models in critical fields increasingly grapple with multimodal data, they face the dual challenges of handling a wide array of modalities, often incomplete due to missing elements, and the temporal irregularity and sparsity of collected samples. Successfully leveraging this complex data, while overcoming the scarcity of high-quality training samples, is key to improving these models' predictive performance. We introduce ``FuseMoE'', a mixture-of-experts framework incorporated with an innovative gating function. Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories. Theoretically, our unique gating function contributes to enhanced convergence rates, leading to better performance in multiple downstream tasks. The practical utility of FuseMoE in real world is validated by a challenging set of clinical risk prediction tasks.
The Matrix: A Bayesian learning model for LLMs
Dalal, Siddhartha, Misra, Vishal
In this paper, we introduce a Bayesian learning model to understand the behavior of Large Language Models (LLMs). We explore the optimization metric of LLMs, which is based on predicting the next token, and develop a novel model grounded in this principle. Our approach involves constructing an ideal generative text model represented by a multinomial transition probability matrix with a prior, and we examine how LLMs approximate this matrix. We discuss the continuity of the mapping between embeddings and multinomial distributions, and present the Dirichlet approximation theorem to approximate any prior. Additionally, we demonstrate how text generation by LLMs aligns with Bayesian learning principles and delve into the implications for in-context learning, specifically explaining why in-context learning emerges in larger models where prompts are considered as samples to be updated. Our findings indicate that the behavior of LLMs is consistent with Bayesian Learning, offering new insights into their functioning and potential applications.
Toward Green and Human-Like Artificial Intelligence: A Complete Survey on Contemporary Few-Shot Learning Approaches
Tsoumplekas, Georgios, Li, Vladislav, Argyriou, Vasileios, Lytos, Anastasios, Fountoukidis, Eleftherios, Goudos, Sotirios K., Moscholios, Ioannis D., Sarigiannidis, Panagiotis
Despite deep learning's widespread success, its data-hungry and computationally expensive nature makes it impractical for many data-constrained real-world applications. Few-Shot Learning (FSL) aims to address these limitations by enabling rapid adaptation to novel learning tasks, seeing significant growth in recent years. This survey provides a comprehensive overview of the field's latest advancements. Initially, FSL is formally defined, and its relationship with different learning fields is presented. A novel taxonomy is introduced, extending previously proposed ones, and real-world applications in classic and novel fields are described. Finally, recent trends shaping the field, outstanding challenges, and promising future research directions are discussed.
Diffusive Gibbs Sampling
Chen, Wenlin, Zhang, Mingtian, Paige, Brooks, Hernández-Lobato, José Miguel, Barber, David
The inadequate mixing of conventional Markov Chain Monte Carlo (MCMC) methods for multi-modal distributions presents a significant challenge in practical applications such as Bayesian inference and molecular dynamics. Addressing this, we propose Diffusive Gibbs Sampling (DiGS), an innovative family of sampling methods designed for effective sampling from distributions characterized by distant and disconnected modes. DiGS integrates recent developments in diffusion models, leveraging Gaussian convolution to create an auxiliary noisy distribution that bridges isolated modes in the original space and applying Gibbs sampling to alternately draw samples from both spaces. Our approach exhibits a better mixing property for sampling multi-modal distributions than state-of-the-art methods such as parallel tempering. We demonstrate that our sampler attains substantially improved results across various tasks, including mixtures of Gaussians, Bayesian neural networks and molecular dynamics.
Graph Neural Machine: A New Model for Learning with Tabular Data
Nikolentzos, Giannis, Wang, Siyun, Lutzeyer, Johannes, Vazirgiannis, Michalis
In recent years, there has been a growing interest in mapping data from different domains to graph structures. Among others, neural network models such as the multi-layer perceptron (MLP) can be modeled as graphs. In fact, MLPs can be represented as directed acyclic graphs. Graph neural networks (GNNs) have recently become the standard tool for performing machine learning tasks on graphs. In this work, we show that an MLP is equivalent to an asynchronous message passing GNN model which operates on the MLP's graph representation. We then propose a new machine learning model for tabular data, the so-called Graph Neural Machine (GNM), which replaces the MLP's directed acyclic graph with a nearly complete graph and which employs a synchronous message passing scheme. We show that a single GNM model can simulate multiple MLP models. We evaluate the proposed model in several classification and regression datasets. In most cases, the GNM model outperforms the MLP architecture.