Bayesian Inference
A Bayesian Approach for Discovering Time- Delayed Differential Equation from Data
Chowdhury, Debangshu, Chakraborty, Souvik
Time-delayed differential equations (TDDEs) are widely used to model complex dynamic systems where future states depend on past states with a delay. However, inferring the underlying TDDEs from observed data remains a challenging problem due to the inherent nonlinearity, uncertainty, and noise in real-world systems. Conventional equation discovery methods often exhibit limitations when dealing with large time delays, relying on deterministic techniques or optimization-based approaches that may struggle with scalability and robustness. In this paper, we present BayTiDe - Bayesian Approach for Discovering Time-Delayed Differential Equations from Data, that is capable of identifying arbitrarily large values of time delay to an accuracy that is directly proportional to the resolution of the data input to it. BayTiDe leverages Bayesian inference combined with a sparsity-promoting discontinuous spike-and-slab prior to accurately identify time-delayed differential equations. The approach accommodates arbitrarily large time delays with accuracy proportional to the input data resolution, while efficiently narrowing the search space to achieve significant computational savings. We demonstrate the efficiency and robustness of BayTiDe through a range of numerical examples, validating its ability to recover delayed differential equations from noisy data.
A Sequential Optimal Learning Approach to Automated Prompt Engineering in Large Language Models
Wang, Shuyang, Moazeni, Somayeh, Klabjan, Diego
Designing effective prompts is essential to guiding large language models (LLMs) toward desired responses. Automated prompt engineering aims to reduce reliance on manual effort by streamlining the design, refinement, and optimization of natural language prompts. This paper proposes an optimal learning framework for automated prompt engineering, designed to sequentially identify effective prompt features while efficiently allocating a limited evaluation budget. We introduce a feature-based method to express prompts, which significantly broadens the search space. Bayesian regression is employed to utilize correlations among similar prompts, accelerating the learning process. To efficiently explore the large space of prompt features for a high quality prompt, we adopt the forward-looking Knowledge-Gradient (KG) policy for sequential optimal learning. The KG policy is computed efficiently by solving mixed-integer second-order cone optimization problems, making it scalable and capable of accommodating prompts characterized only through constraints. We demonstrate that our method significantly outperforms a set of benchmark strategies assessed on instruction induction tasks. The results highlight the advantages of using the KG policy for prompt learning given a limited evaluation budget. Our framework provides a solution to deploying automated prompt engineering in a wider range applications where prompt evaluation is costly.
Information Design with Unknown Prior
Classical information design models (e.g., Bayesian persuasion and cheap talk) require players to have perfect knowledge of the prior distribution of the state of the world. Our paper studies repeated persuasion problems in which the information designer does not know the prior. The information designer learns to design signaling schemes from repeated interactions with the receiver. We design learning algorithms for the information designer to achieve no regret compared to using the optimal signaling scheme with known prior, under two models of the receiver's decision-making. (1) The first model assumes that the receiver knows the prior and can perform posterior update and best respond to signals. In this model, we design a learning algorithm for the information designer with $O(\log T)$ regret in the general case, and another algorithm with $\Theta(\log \log T)$ regret in the case where the receiver has only two actions. (2) The second model assumes that the receiver does not know the prior and employs a no-regret learning algorithm to take actions. We show that the information designer can achieve regret $O(\sqrt{\mathrm{rReg}(T) T})$, where $\mathrm{rReg}(T)=o(T)$ is an upper bound on the receiver's learning regret. Our work thus provides a learning foundation for the problem of information design with unknown prior.
Re-examining Granger Causality from Causal Bayesian Networks Perspective
The emergence of machine learning (ML) has been phenomenal, with ML-based models outperforming human intelligence, as in the case of AlphaGo [1] and, more recently, large language models (LLMs). With these advances, ML became state-of-the-art for scientific discovery in various fields of study [2]. However, ML algorithms fail to answer the crucial question "what" brings about an effect and "what if" questions i.e., ML cannot identify causal relationships in data and counterfactual questions. Hence, the need for causality and causal inference a field that focuses on unravelling causal interactions in data. Characterising these interactions in complex dynamical systems is a fundamental question in science [3]. Causal structure learning (CSL)--a computational causal discovery field, taking advantage of statistics and machine learning (ML) to unravel causal relations in data--is particularly appealing because it enables us to answer counterfactual questions [4, 5, 6, 7]. We adopt Pearl's causality framework.
IRIS: A Bayesian Approach for Image Reconstruction in Radio Interferometry with expressive Score-Based priors
Dia, Noé, Yantovski-Barth, M. J., Adam, Alexandre, Bowles, Micah, Perreault-Levasseur, Laurence, Hezaveh, Yashar, Scaife, Anna
Inferring sky surface brightness distributions from noisy interferometric data in a principled statistical framework has been a key challenge in radio astronomy. In this work, we introduce Imaging for Radio Interferometry with Score-based models (IRIS). We use score-based models trained on optical images of galaxies as an expressive prior in combination with a Gaussian likelihood in the uv-space to infer images of protoplanetary disks from visibility data of the DSHARP survey conducted by ALMA. We demonstrate the advantages of this framework compared with traditional radio interferometry imaging algorithms, showing that it produces plausible posterior samples despite the use of a misspecified galaxy prior. Through coverage testing on simulations, we empirically evaluate the accuracy of this approach to generate calibrated posterior samples.
Transformers Simulate MLE for Sequence Generation in Bayesian Networks
Cao, Yuan, He, Yihan, Wu, Dennis, Chen, Hong-Yu, Fan, Jianqing, Liu, Han
Transformers (Vaswani et al. 2017) have achieved tremendous success across various fields. These models are known to be particularly strong in terms of sequence generation, and have revolutionized the way we approach problems related to text generation, translation, and scientific discoveries such as protein generation. Despite these achievements, there remains limited understanding of the theoretical capabilities of transformers as sequence generators. To theoretically understand how transformers efficiently generate sequences, several recent works have studied the the power of transformers in learning specific probability models for sequential data (Ildiz et al. 2024, Rajaraman et al. 2024, Makkuva et al. 2024, Nichani et al. 2024). Specifically, Ildiz et al. (2024) studied the problem of learning Markov chains with a one-layer self-attention model, and developed identifiability and convergence guarantees under certain conditions. Rajaraman et al. (2024) studied the behavior of transformers on data drawn from k-order Markov processes, where the conditional distribution of the next variable in a sequence depends on the previous k variables, and showed that such processes can be learned well by transformers of a constant-order depth. Makkuva et al. (2024) further studied the loss function landscape of one-layer transformers in learning Markov chains. Nichani et al. (2024) studied a setting where the tokens consist of multiple sequences of samples generated from a causal network, and demonstrated that transformers can be trained to learn the causal network structure so that, when seeing a new context-query pair, it can generate prediction according to the learned causal structure and the context. However, similar to the studies of Markov chains, Nichani et al. (2024) mostly focused on the setting where each variable has at most one parent.
Learning when to rank: Estimation of partial rankings from sparse, noisy comparisons
Morel-Balbi, Sebastian, Kirkley, Alec
A common task arising in various domains is that of ranking items based on the outcomes of pairwise comparisons, from ranking players and teams in sports to ranking products or brands in marketing studies and recommendation systems. Statistical inference-based methods such as the Bradley-Terry model, which extract rankings based on an underlying generative model of the comparison outcomes, have emerged as flexible and powerful tools to tackle the task of ranking in empirical data. In situations with limited and/or noisy comparisons, it is often challenging to confidently distinguish the performance of different items based on the evidence available in the data. However, existing inference-based ranking methods overwhelmingly choose to assign each item to a unique rank or score, suggesting a meaningful distinction when there is none. Here, we address this problem by developing a principled Bayesian methodology for learning partial rankings -- rankings with ties -- that distinguishes among the ranks of different items only when there is sufficient evidence available in the data. Our framework is adaptable to any statistical ranking method in which the outcomes of pairwise observations depend on the ranks or scores of the items being compared. We develop a fast agglomerative algorithm to perform Maximum A Posteriori (MAP) inference of partial rankings under our framework and examine the performance of our method on a variety of real and synthetic network datasets, finding that it frequently gives a more parsimonious summary of the data than traditional ranking, particularly when observations are sparse.
Reweighting Improves Conditional Risk Bounds
Zhang, Yikai, Lin, Jiahe, Li, Fengpei, Zheng, Songzhu, Raj, Anant, Schneider, Anderson, Nevmyvaka, Yuriy
In this work, we study the weighted empirical risk minimization (weighted ERM) schema, in which an additional data-dependent weight function is incorporated when the empirical risk function is being minimized. We show that under a general ``balanceable" Bernstein condition, one can design a weighted ERM estimator to achieve superior performance in certain sub-regions over the one obtained from standard ERM, and the superiority manifests itself through a data-dependent constant term in the error bound. These sub-regions correspond to large-margin ones in classification settings and low-variance ones in heteroscedastic regression settings, respectively. Our findings are supported by evidence from synthetic data experiments.
Multi-View Majority Vote Learning Algorithms: Direct Minimization of PAC-Bayesian Bounds
Hennequin, Mehdi, Zitouni, Abdelkrim, Benabdeslem, Khalid, Elghazel, Haytham, Gaci, Yacine
The PAC-Bayesian framework has significantly advanced the understanding of statistical learning, particularly for majority voting methods. Despite its successes, its application to multi-view learning -- a setting with multiple complementary data representations -- remains underexplored. In this work, we extend PAC-Bayesian theory to multi-view learning, introducing novel generalization bounds based on R\'enyi divergence. These bounds provide an alternative to traditional Kullback-Leibler divergence-based counterparts, leveraging the flexibility of R\'enyi divergence. Furthermore, we propose first- and second-order oracle PAC-Bayesian bounds and extend the C-bound to multi-view settings. To bridge theory and practice, we design efficient self-bounding optimization algorithms that align with our theoretical results.
Summarizing Bayesian Nonparametric Mixture Posterior -- Sliced Optimal Transport Metrics for Gaussian Mixtures
Existing methods to summarize posterior inference for mixture models focus on identifying a point estimate of the implied random partition for clustering, with density estimation as a secondary goal (Wade and Ghahramani, 2018; Dahl et al., 2022). We propose a novel approach for summarizing posterior inference in nonparametric Bayesian mixture models, prioritizing density estimation of the mixing measure (or mixture) as an inference target. One of the key features is the model-agnostic nature of the approach, which remains valid under arbitrarily complex dependence structures in the underlying sampling model. Using a decision-theoretic framework, our method identifies a point estimate by minimizing posterior expected loss. A loss function is defined as a discrepancy between mixing measures. Estimating the mixing measure implies inference on the mixture density and the random partition. Exploiting the discrete nature of the mixing measure, we use a version of sliced Wasserstein distance. We introduce two specific variants for Gaussian mixtures. The first, mixed sliced Wasserstein, applies generalized geodesic projections on the product of the Euclidean space and the manifold of symmetric positive definite matrices. The second, sliced mixture Wasserstein, leverages the linearity of Gaussian mixture measures for efficient projection.