Bayesian Inference
Blink of an eye: a simple theory for feature localization in generative models
Li, Marvin, Karan, Aayush, Chen, Sitan
Large language models (LLMs) can exhibit undesirable and unexpected behavior in the blink of an eye. In a recent Anthropic demo, Claude switched from coding to Googling pictures of Yellowstone, and these sudden shifts in behavior have also been observed in reasoning patterns and jailbreaks. This phenomenon is not unique to autoregressive models: in diffusion models, key features of the final output are decided in narrow ``critical windows'' of the generation process. In this work we develop a simple, unifying theory to explain this phenomenon. We show that it emerges generically as the generation process localizes to a sub-population of the distribution it models. While critical windows have been studied at length in diffusion models, existing theory heavily relies on strong distributional assumptions and the particulars of Gaussian diffusion. In contrast to existing work our theory (1) applies to autoregressive and diffusion models; (2) makes no distributional assumptions; (3) quantitatively improves previous bounds even when specialized to diffusions; and (4) requires basic tools and no stochastic calculus or statistical physics-based machinery. We also identify an intriguing connection to the all-or-nothing phenomenon from statistical inference. Finally, we validate our predictions empirically for LLMs and find that critical windows often coincide with failures in problem solving for various math and reasoning benchmarks.
Federated Generalised Variational Inference: A Robust Probabilistic Federated Learning Framework
Mildner, Terje, Hamelijnck, Oliver, Giampouras, Paris, Damoulas, Theodoros
We introduce FedGVI, a probabilistic Federated Learning (FL) framework that is provably robust to both prior and likelihood misspecification. FedGVI addresses limitations in both frequentist and Bayesian FL by providing unbiased predictions under model misspecification, with calibrated uncertainty quantification. Our approach generalises previous FL approaches, specifically Partitioned Variational Inference (Ashman et al., 2022), by allowing robust and conjugate updates, decreasing computational complexity at the clients. We offer theoretical analysis in terms of fixed-point convergence, optimality of the cavity distribution, and provable robustness. Additionally, we empirically demonstrate the effectiveness of FedGVI in terms of improved robustness and predictive performance on multiple synthetic and real world classification data sets.
Stochastic Linear Bandits with Latent Heterogeneity
Chen, Elynn, Chen, Xi, Jing, Wenbo, Liu, Xiao
This paper addresses the critical challenge of latent heterogeneity in online decision-making, where individual responses to business actions vary due to unobserved characteristics. While existing approaches in data-driven decision-making have focused on observable heterogeneity through contextual features, they fall short when heterogeneity stems from unobservable factors such as lifestyle preferences and personal experiences. We propose a novel latent heterogeneous bandit framework that explicitly models this unobserved heterogeneity in customer responses, with promotion targeting as our primary example. Our methodology introduces an innovative algorithm that simultaneously learns latent group memberships and group-specific reward functions. Through theoretical analysis and empirical validation using data from a mobile commerce platform, we establish high-probability bounds for parameter estimation, convergence rates for group classification, and comprehensive regret bounds. Notably, our theoretical analysis reveals two distinct types of regret measures: a ``strong regret'' against an oracle with perfect knowledge of customer memberships, which remains non-sub-linear due to inherent classification uncertainty, and a ``regular regret'' against an oracle aware only of deterministic components, for which our algorithm achieves a sub-linear rate that is minimax optimal in horizon length and dimension. We further demonstrate that existing bandit algorithms ignoring latent heterogeneity incur constant average regret that accumulates linearly over time. Our framework provides practitioners with new tools for decision-making under latent heterogeneity and extends to various business applications, including personalized pricing, resource allocation, and inventory management.
Model Successor Functions
Chang, Yingshan, Bisk, Yonatan
The notion of generalization has moved away from the classical one defined in statistical learning theory towards an emphasis on out-of-domain generalization (OODG). Recently, there is a growing focus on inductive generalization, where a progression of difficulty implicitly governs the direction of domain shifts. In inductive generalization, it is often assumed that the training data lie in the easier side, while the testing data lie in the harder side. The challenge is that training data are always finite, but a learner is expected to infer an inductive principle that could be applied in an unbounded manner. This emerging regime has appeared in the literature under different names, such as length/logical/algorithmic extrapolation, but a formal definition is lacking. This work provides such a formalization that centers on the concept of model successors. Then we outline directions to adapt well-established techniques towards the learning of model successors. This work calls for restructuring of the research discussion around inductive generalization from fragmented task-centric communities to a more unified effort, focused on universal properties of learning and computation.
Combining physics-based and data-driven models: advancing the frontiers of research with Scientific Machine Learning
Quarteroni, Alfio, Gervasio, Paola, Regazzoni, Francesco
Scientific Machine Learning (SciML) is a recently emerged research field which combines physics-based and data-driven models for the numerical approximation of differential problems. Physics-based models rely on the physical understanding of the problem at hand, subsequent mathematical formulation, and numerical approximation. Data-driven models instead aim to extract relations between input and output data without arguing any causality principle underlining the available data distribution. In recent years, data-driven models have been rapidly developed and popularized. Such a diffusion has been triggered by a huge availability of data (the so-called big data), an increasingly cheap computing power, and the development of powerful machine learning algorithms. SciML leverages the physical awareness of physics-based models and, at the same time, the efficiency of data-driven algorithms. With SciML, we can inject physics and mathematical knowledge into machine learning algorithms. Yet, we can rely on data-driven algorithms' capability to discover complex and non-linear patterns from data and improve the descriptive capacity of physics-based models. After recalling the mathematical foundations of digital modelling and machine learning algorithms, and presenting the most popular machine learning architectures, we discuss the great potential of a broad variety of SciML strategies in solving complex problems governed by partial differential equations. Finally, we illustrate the successful application of SciML to the simulation of the human cardiac function, a field of significant socio-economic importance that poses numerous challenges on both the mathematical and computational fronts. The corresponding mathematical model is a complex system of non-linear ordinary and partial differential equations describing the electromechanics, valve dynamics, blood circulation, perfusion in the coronary tree, and torso potential. Despite the robustness and accuracy of physics-based models, certain aspects, such as unveiling constitutive laws for cardiac cells and myocardial material properties, as well as devising efficient reduced order models to dominate the extraordinary computational complexity, have been successfully tackled by leveraging data-driven models.
Beyond Prior Limits: Addressing Distribution Misalignment in Particle Filtering
Shi, Yiwei, Hu, Jingyu, Zhang, Yu, Yang, Mengyue, Zhang, Weinan, Liu, Cunjia, Liu, Weiru
Particle filtering is a Bayesian inference method and a fundamental tool in state estimation for dynamic systems, but its effectiveness is often limited by the constraints of the initial prior distribution, a phenomenon we define as the Prior Boundary Phenomenon. This challenge arises when target states lie outside the prior's support, rendering traditional particle filtering methods inadequate for accurate estimation. Although techniques like unbounded priors and larger particle sets have been proposed, they remain computationally prohibitive and lack adaptability in dynamic scenarios. To systematically overcome these limitations, we propose the Diffusion-Enhanced Particle Filtering Framework, which introduces three key innovations: adaptive diffusion through exploratory particles, entropy-driven regularisation to prevent weight collapse, and kernel-based perturbations for dynamic support expansion. These mechanisms collectively enable particle filtering to explore beyond prior boundaries, ensuring robust state estimation for out-of-boundary targets.
BARNN: A Bayesian Autoregressive and Recurrent Neural Network
Coscia, Dario, Welling, Max, Demo, Nicola, Rozza, Gianluigi
Autoregressive and recurrent networks have achieved remarkable progress across various fields, from weather forecasting to molecular generation and Large Language Models. Despite their strong predictive capabilities, these models lack a rigorous framework for addressing uncertainty, which is key in scientific applications such as PDE solving, molecular generation and Machine Learning Force Fields. To address this shortcoming we present BARNN: a variational Bayesian Autoregressive and Recurrent Neural Network. BARNNs aim to provide a principled way to turn any autoregressive or recurrent model into its Bayesian version. BARNN is based on the variational dropout method, allowing to apply it to large recurrent neural networks as well. We also introduce a temporal version of the "Variational Mixtures of Posteriors" prior (tVAMP-prior) to make Bayesian inference efficient and well-calibrated. Extensive experiments on PDE modelling and molecular generation demonstrate that BARNN not only achieves comparable or superior accuracy compared to existing methods, but also excels in uncertainty quantification and modelling long-range dependencies.
Joint Optimization of Prompt Security and System Performance in Edge-Cloud LLM Systems
Huang, Haiyang, Meng, Tianhui, Jia, Weijia
Large language models (LLMs) have significantly facilitated human life, and prompt engineering has improved the efficiency of these models. However, recent years have witnessed a rise in prompt engineering-empowered attacks, leading to issues such as privacy leaks, increased latency, and system resource wastage. Though safety fine-tuning based methods with Reinforcement Learning from Human Feedback (RLHF) are proposed to align the LLMs, existing security mechanisms fail to cope with fickle prompt attacks, highlighting the necessity of performing security detection on prompts. In this paper, we jointly consider prompt security, service latency, and system resource optimization in Edge-Cloud LLM (EC-LLM) systems under various prompt attacks. To enhance prompt security, a vector-database-enabled lightweight attack detector is proposed. We formalize the problem of joint prompt detection, latency, and resource optimization into a multi-stage dynamic Bayesian game model. The equilibrium strategy is determined by predicting the number of malicious tasks and updating beliefs at each stage through Bayesian updates. The proposed scheme is evaluated on a real implemented EC-LLM system, and the results demonstrate that our approach offers enhanced security, reduces the service latency for benign users, and decreases system resource consumption compared to state-of-the-art algorithms.
Estimating the Probability of Sampling a Trained Neural Network at Random
They evaluate simple mass, under a Gaussian or uniform prior, gradient-free learning algorithms, such as the "Guess & of a region in neural network parameter space Check" optimizer which randomly samples parameters until corresponding to a particular behavior, such as it stumbles upon a network that achieves training loss achieving test loss below some threshold. When under some threshold, and find that these methods have the prior is uniform, this problem is equivalent similar generalization behavior to gradient descent, at least to measuring the volume of a region. We show on the very simple tasks they tested. Teney et al. (2024) empirically and theoretically that existing algorithms find that randomly initialized networks represent very simple for estimating volumes in parameter space functions, which would explain the simplicity bias of underestimate the true volume by millions of orders deep learning if SGD behaves similarly to Guess & Check. of magnitude. We find that this error can be dramatically reduced, but not entirely eliminated, Additionally, Mingard et al. (2021) provide evidence that with an importance sampling method using SGD may be an approximate Bayesian sampler, where the gradient information that is already provided prior distribution over functions is equal to the distribution by popular optimizers. The negative logarithm of over functions represented by randomly initialized networks.
Bayesian Optimization with Preference Exploration by Monotonic Neural Network Ensemble
Wang, Hanyang, Branke, Juergen, Poloczek, Matthias
In MOO, there is usually not a single optimal solution, but a range of so-called Pareto optimal or non-dominated Many real-world black-box optimization problems solutions with different trade-offs. A widely adopted approach have multiple conflicting objectives. Rather aims to search for a good representation of these than attempting to approximate the entire set of Pareto-optimal solutions by maximizing their hypervolume. Pareto-optimal solutions, interactive preference Two prominent methods stand out in this regard: ParEGO learning, i.e., optimization with a decision maker (Knowles, 2006), which employs random augmented Chebyshev in the loop, allows to focus the search on the scalarizations for optimization in each iteration, and most relevant subset. However, few previous studies expected hypervolume maximization (Yang et al., 2019; have exploited the fact that utility functions Daulton et al., 2020), which directly maximizes the hypervolume are usually monotonic.