Bayesian Inference
Continuously Augmented Discrete Diffusion model for Categorical Generative Modeling
Zheng, Huangjie, Gong, Shansan, Zhang, Ruixiang, Chen, Tianrong, Gu, Jiatao, Zhou, Mingyuan, Jaitly, Navdeep, Zhang, Yizhe
Standard discrete diffusion models treat all unobserved states identically by mapping them to an absorbing [MASK] token. This creates an 'information void' where semantic information that could be inferred from unmasked tokens is lost between denoising steps. We introduce Continuously Augmented Discrete Diffusion (CADD), a framework that augments the discrete state space with a paired diffusion in a continuous latent space. This yields graded, gradually corrupted states in which masked tokens are represented by noisy yet informative latent vectors rather than collapsed 'information voids'. At each reverse step, CADD may leverage the continuous latent as a semantic hint to guide discrete denoising. The design is clean and compatible with existing discrete diffusion training. At sampling time, the strength and choice of estimator for the continuous latent vector enables a controlled trade-off between mode-coverage (generating diverse outputs) and mode-seeking (generating contextually precise outputs) behaviors. Empirically, we demonstrate CADD improves generative quality over mask-based diffusion across text generation, image synthesis, and code modeling, with consistent gains on both qualitative and quantitative metrics against strong discrete baselines.
Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization
Won, Yunjae, Lee, Hyunji, Hwang, Hyeonbin, Seo, Minjoon
Direct Preference Optimization (DPO) has been widely used for aligning language models with human preferences in a supervised manner. However, several key questions remain unresolved: the rationale behind its log-ratio reward, how the statistical structure of preference datasets shapes its training dynamics, and how those dynamics impact downstream capabilities. We approach these questions from a Bayesian perspective, interpreting the goal of preference optimization as learning the differential information required to update a reference policy into a target policy. To formalize this view, we introduce the Differential Information Distribution (DID), defined as the distribution over samples that carry the Bayesian evidence required to update policies. We introduce three complementary insights by viewing preference optimization through the DID. First, we find that DPO's log-ratio reward is uniquely justified when preferences encode the Differential Information needed to update a reference policy into the target policy. Second, we discuss how commonly observed training dynamics in DPO, including changes in log-likelihood and policy exploration, stem from a power-law DID relationship. Finally, we analyze how training dynamics influence downstream performance using the entropy of DID, a principled measure of uncertainty in the learned information. We observe that learning high-entropy DID improves open-ended instruction-following, while low-entropy DID benefits knowledge-intensive QA. Taken together, our results show that DPO's reward design, training dynamics, and downstream capabilities all emerge as natural consequences of learning Differential Information, offering both a principled theoretical foundation and practical guidance for preference-based alignment.
Large-Scale Bayesian Causal Discovery with Interventional Data
Han, Seong Woo, Vo, Daniel Duy, Brown, Brielin C.
Inferring the causal relationships among a set of variables in the form of a directed acyclic graph (DAG) is an important but notoriously challenging problem. Recently, advancements in high-throughput genomic perturbation screens have inspired development of methods that leverage interventional data to improve model identification. However, existing methods still suffer poor performance on large-scale tasks and fail to quantify uncertainty. Here, we propose Interventional Bayesian Causal Discovery (IBCD), an empirical Bayesian framework for causal discovery with interventional data. Our approach models the likelihood of the matrix of total causal effects, which can be approximated by a matrix normal distribution, rather than the full data matrix. We place a spike-and-slab horseshoe prior on the edges and separately learn data-driven weights for scale-free and Erdลs-Rรฉnyi structures from observational data, treating each edge as a latent variable to enable uncertainty-aware inference. Through extensive simulation, we show that IBCD achieves superior structure recovery compared to existing baselines. We apply IBCD to CRISPR perturbation (Perturb-seq) data on 521 genes, demonstrating that edge posterior inclusion probabilities enable identification of robust graph structures.
Robust Classification of Oral Cancer with Limited Training Data
Sonawane, Akshay Bhagwan, Swamikannan, Lena D., Tamil, Lakshman
Oral cancer ranks among the most prevalent cancers globally, with a particularly high mortality rate in regions lacking adequate healthcare access. Early diagnosis is crucial for reducing mortality; however, challenges persist due to limited oral health programs, inadequate infrastructure, and a shortage of healthcare practitioners. Conventional deep learning models, while promising, often rely on point estimates, leading to overconfidence and reduced reliability. Critically, these models require large datasets to mitigate overfitting and ensure generalizability, an unrealistic demand in settings with limited training data. To address these issues, we propose a hybrid model that combines a convolutional neural network (CNN) with Bayesian deep learning for oral cancer classification using small training sets. This approach employs variational inference to enhance reliability through uncertainty quantification. The model was trained on photographic color images captured by smartphones and evaluated on three distinct test datasets. The proposed method achieved 94% accuracy on a test dataset with a distribution similar to that of the training data, comparable to traditional CNN performance. Notably, for real-world photographic image data, despite limitations and variations differing from the training dataset, the proposed model demonstrated superior generalizability, achieving 88% accuracy on diverse datasets compared to 72.94% for traditional CNNs, even with a smaller dataset. Confidence analysis revealed that the model exhibits low uncertainty (high confidence) for correctly classified samples and high uncertainty (low confidence) for misclassified samples. These results underscore the effectiveness of Bayesian inference in data-scarce environments in enhancing early oral cancer diagnosis by improving model reliability and generalizability.
Modeling Others' Minds as Code
Jha, Kunal, Huang, Aydan Yuenan, Ye, Eric, Jaques, Natasha, Kleiman-Weiner, Max
Accurate prediction of human behavior is essential for robust and safe human-AI collaboration. However, existing approaches for modeling people are often data-hungry and brittle because they either make unrealistic assumptions about rationality or are too computationally demanding to adapt rapidly. Our key insight is that many everyday social interactions may follow predictable patterns; efficient "scripts" that minimize cognitive load for actors and observers, e.g., "wait for the green light, then go." We propose modeling these routines as behavioral programs instantiated in computer code rather than policies conditioned on beliefs and desires. We introduce ROTE, a novel algorithm that leverages both large language models (LLMs) for synthesizing a hypothesis space of behavioral programs, and probabilistic inference for reasoning about uncertainty over that space. We test ROTE in a suite of gridworld tasks and a large-scale embodied household simulator. ROTE predicts human and AI behaviors from sparse observations, outperforming competitive baselines -- including behavior cloning and LLM-based methods -- by as much as 50% in terms of in-sample accuracy and out-of-sample generalization. By treating action understanding as a program synthesis problem, ROTE opens a path for AI systems to efficiently and effectively predict human behavior in the real-world.
A Proofs of Propositions Lemma 4 Let
Equation 9. Therefore if we define a standard "policy" loss L This is the "soft" version of an analogous statement made for "hard" optimality first shown in [32]. This argument is the direct counterpart to Theorem 2 in [32]--which uses argmax instead of softmax. From this point onwards, the same strategy for Proposition 2 again applies, completing the proof. Environments used for experiments are from OpenAI gym [56]. Each environment is associated with a true reward function (unknown to all imitation algorithms).