Goto

Collaborating Authors

 Bayesian Inference


Nonparametric inference under shape constraints: past, present and future

arXiv.org Machine Learning

We survey the field of nonparametric inference under shape constraints, providing a historical overview and a perspective on its current state. An outlook and some open problems offer thoughts on future directions. 1 Introduction. Traditionally, we think of statistical methods as being divided into parametric approaches, which can be restrictive, but where estimation is typically straightforward (e.g. using maximum likelihood), and nonparametric methods, which are more flexible but often require careful choices of tuning parameters. Nonparametric inference under shape constraints sits somewhere in the middle, seeking in some ways the best of both worlds. The origins of the field are often traced to Grenander (1956), who proved that there exists a unique maximum likelihood estimator (MLE) of a decreasing density on the non-negative half-line (and was able to characterise it explicitly).


Reconcile Certified Robustness and Accuracy for DNN-based Smoothed Majority Vote Classifier

arXiv.org Artificial Intelligence

Within the PAC-Bayesian framework, the Gibbs classifier (defined on a posterior $Q$) and the corresponding $Q$-weighted majority vote classifier are commonly used to analyze the generalization performance. However, there exists a notable lack in theoretical research exploring the certified robustness of majority vote classifier and its interplay with generalization. In this study, we develop a generalization error bound that possesses a certified robust radius for the smoothed majority vote classifier (i.e., the $Q$-weighted majority vote classifier with smoothed inputs); In other words, the generalization bound holds under any data perturbation within the certified robust radius. As a byproduct, we find that the underpinnings of both the generalization bound and the certified robust radius draw, in part, upon weight spectral norm, which thereby inspires the adoption of spectral regularization in smooth training to boost certified robustness. Utilizing the dimension-independent property of spherical Gaussian inputs in smooth training, we propose a novel and inexpensive spectral regularizer to enhance the smoothed majority vote classifier. In addition to the theoretical contribution, a set of empirical results is provided to substantiate the effectiveness of our proposed method.


RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance

arXiv.org Artificial Intelligence

Diffusion large language models (dLLMs) have shown great potential in large-scale language modeling, and there is an increasing interest in further improving the capacity to solve complex problems by guiding the reasoning process step by step. Common practice for autoregressive language models typically learns a process reward model with dense annotation for each intermediate step. However, this is challenging for dLLMs where the generation is in an any-order fashion and intermediate states are partially masked sentences. To this end, in this paper, we propose reward-free guidance (RFG), a principled method for guiding the reasoning trajectory of dLLMs without explicit process reward. The key idea of RFG is to parameterize the process reward by log-likelihood ratios of the enhanced and reference dLLMs, where the enhanced model can be easily obtained by any off-the-shelf dLLM that has been post-trained with reinforcement learning (RL) or supervised fine-tuning (SFT). We provide theoretical justification that RFG induces the reward-guided sampling distribution with no additional reward. We conduct comprehensive experiments on four challenging mathematical reasoning and code generation benchmarks using a diverse suite of dLLMs enhanced with various post-training methods. RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%. These findings establish RFG as a general training-free framework that scales test-time reasoning without reliance on external reward models. By scaling up mask-predict pretraining on large-scale corpora through bidirectional computation, dLLMs have shown surprisingly competitive or even superior performance over autoregressive (AR) model baselines (Prabhudesai et al., 2025). Despite the impressive advancements, the current success of dLLMs is primarily limited to pre-training or continue-training on a specific domain, with limited exploration in test-time computation and alignment.


Multi-Task Equation Discovery

arXiv.org Artificial Intelligence

Equation discovery provides a grey-box approach to system identification by uncovering governing dynamics directly from observed data. However, a persistent challenge lies in ensuring that identified models generalise across operating conditions rather than over-fitting to specific datasets. This work investigates this issue by applying a Bayesian relevance vector machine (RVM) within a multi-task learning (MTL) framework for simultaneous parameter identification across multiple datasets. In this formulation, responses from the same structure under different excitation levels are treated as related tasks that share model parameters but retain task-specific noise characteristics. A simulated single degree-of-freedom oscillator with linear and cubic stiffness provided the case study, with datasets generated under three excitation regimes. Standard single-task RVM models were able to reproduce system responses but often failed to recover the true governing terms when excitations insufficiently stimulated non-linear dynamics. By contrast, the MTL-RVM combined information across tasks, improving parameter recovery for weakly and moderately excited datasets, while maintaining strong performance under high excitation. These findings demonstrate that multi-task Bayesian inference can mitigate over-fitting and promote generalisation in equation discovery. The approach is particularly relevant to structural health monitoring, where varying load conditions reveal complementary aspects of system physics.


Learning to Condition: A Neural Heuristic for Scalable MPE Inference

arXiv.org Artificial Intelligence

We introduce learning to condition (L2C), a scalable, data-driven framework for accelerating Most Probable Explanation (MPE) inference in Probabilistic Graphical Models (PGMs), a fundamentally intractable problem. L2C trains a neural network to score variable-value assignments based on their utility for conditioning, given observed evidence. To facilitate supervised learning, we develop a scalable data generation pipeline that extracts training signals from the search traces of existing MPE solvers. The trained network serves as a heuristic that integrates with search algorithms, acting as a conditioning strategy prior to exact inference or as a branching and node selection policy within branch-and-bound solvers. We evaluate L2C on challenging MPE queries involving high-treewidth PGMs. Experiments show that our learned heuristic significantly reduces the search space while maintaining or improving solution quality over state-of-the-art methods.


From Fragile to Certified: Wasserstein Audits of Group Fairness Under Distribution Shift

arXiv.org Artificial Intelligence

Group-fairness metrics (e.g., equalized odds) can vary sharply across resamples and are especially brittle under distribution shift, undermining reliable audits. We propose a Wasserstein distributionally robust framework that certifies worst-case group fairness over a ball of plausible test distributions centered at the empirical law. Our formulation unifies common group fairness notions via a generic conditional-probability functional and defines $\varepsilon$-Wasserstein Distributional Fairness ($\varepsilon$-WDF) as the audit target. Leveraging strong duality, we derive tractable reformulations and an efficient estimator (DRUNE) for $\varepsilon$-WDF. We prove feasibility and consistency and establish finite-sample certification guarantees for auditing fairness, along with quantitative bounds under smoothness and margin conditions. Across standard benchmarks and classifiers, $\varepsilon$-WDF delivers stable fairness assessments under distribution shift, providing a principled basis for auditing and certifying group fairness beyond observational data.


A Unified Probabilistic Framework for Dictionary Learning with Parsimonious Activation

arXiv.org Artificial Intelligence

Dictionary learning is traditionally formulated as an $L_1$-regularized signal reconstruction problem. While recent developments have incorporated discriminative, hierarchical, or generative structures, most approaches rely on encouraging representation sparsity over individual samples that overlook how atoms are shared across samples, resulting in redundant and sub-optimal dictionaries. We introduce a parsimony promoting regularizer based on the row-wise $L_\infty$ norm of the coefficient matrix. This additional penalty encourages entire rows of the coefficient matrix to vanish, thereby reducing the number of dictionary atoms activated across the dataset. We derive the formulation from a probabilistic model with Beta-Bernoulli priors, which provides a Bayesian interpretation linking the regularization parameters to prior distributions. We further establish theoretical calculation for optimal hyperparameter selection and connect our formulation to both Minimum Description Length, Bayesian model selection and pathlet learning. Extensive experiments on benchmark datasets demonstrate that our method achieves substantially improved reconstruction quality (with a 20\% reduction in RMSE) and enhanced representation sparsity, utilizing fewer than one-tenth of the available dictionary atoms, while empirically validating our theoretical analysis.


Reciprocally Coupled Local Estimators Implement Bayesian Information Integration Distributively

Neural Information Processing Systems

Psychophysical experiments have demonstrated that the brain integrates information from multiple sensory cues in a near Bayesian optimal manner. The present study proposes a novel mechanism to achieve this. We consider two reciprocally connected networks, mimicking the integration of heading direction information between the dorsal medial superior temporal (MSTd) and the ventral intraparietal (VIP) areas. Each network serves as a local estimator and receives an independent cue, either the visual or the vestibular, as direct input for the external stimulus. We find that positive reciprocal interactions can improve the decoding accuracy of each individual network as if it implements Bayesian inference from two cues. Our model successfully explains the experimental finding that both MSTd and VIP achieve Bayesian multisensory integration, though each of them only receives a single cue as direct external input. Our result suggests that the brain may implement optimal information integration distributively at each local estimator through the reciprocal connections between cortical regions.


Bayesian inference as iterated random functions with applications to sequential inference in graphical models

Neural Information Processing Systems

We propose a general formalism of iterated random functions with semigroup property, under which exact and approximate Bayesian posterior updates can be viewed as specific instances. A convergence theory for iterated random functions is presented. As an application of the general theory we analyze convergence behaviors of exact and approximate message-passing algorithms that arise in a sequential change point detection problem formulated via a latent variable directed graphical model. The sequential inference algorithm and its supporting theory are illustrated by simulated examples.


Analyzing Hogwild Parallel Gaussian Gibbs Sampling

Neural Information Processing Systems

Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going Hogwild'' and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra.