Bayesian Learning
MC$^2$A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration
Zhao, Shirui, Yin, Jun, Yao, Lingyun, Andraud, Martin, Meert, Wannes, Verhelst, Marian
An increasing number of applications are exploiting sampling-based algorithms for planning, optimization, and inference. The Markov Chain Monte Carlo (MCMC) algorithms form the computational backbone of this emerging branch of machine learning. Unfortunately, the high computational cost limits their feasibility for large-scale problems and real-world applications, and the existing MCMC acceleration solutions are either limited in hardware flexibility or fail to maintain efficiency at the system level across a variety of end-to-end applications. This paper introduces \textbf{MC$^2$A}, an algorithm-hardware co-design framework, enabling efficient and flexible optimization for MCMC acceleration. Firstly, \textbf{MC$^2$A} analyzes the MCMC workload diversity through an extension of the processor performance roofline model with a 3rd dimension to derive the optimal balance between the compute, sampling and memory parameters. Secondly, \textbf{MC$^2$A} proposes a parametrized hardware accelerator architecture with flexible and efficient support of MCMC kernels with a pipeline of ISA-programmable tree-structured processing units, reconfigurable samplers and a crossbar interconnect to support irregular access. Thirdly, the core of \textbf{MC$^2$A} is powered by a novel Gumbel sampler that eliminates exponential and normalization operations. In the end-to-end case study, \textbf{MC$^2$A} achieves an overall {$307.6\times$, $1.4\times$, $2.0\times$, $84.2\times$} speedup compared to the CPU, GPU, TPU and state-of-the-art MCMC accelerator. Evaluated on various representative MCMC workloads, this work demonstrates and exploits the feasibility of general hardware acceleration to popularize MCMC-based solutions in diverse application domains.
Data Transformation Strategies to Remove Heterogeneity
Yoo, Sangbong, Lee, Jaeyoung, Yoon, Chanyoung, Son, Geonyeong, Hong, Hyein, Seo, Seongbum, Yim, Soobin, Jung, Chanyoung, Park, Jungsoo, Kim, Misuk, Jang, Yun
Data heterogeneity is a prevalent issue, stemming from various conflicting factors, making its utilization complex. This uncertainty, particularly resulting from disparities in data formats, frequently necessitates the involvement of experts to find resolutions. Current methodologies primarily address conflicts related to data structures and schemas, often overlooking the pivotal role played by data transformation. As the utilization of artificial intelligence (AI) continues to expand, there is a growing demand for a more streamlined data preparation process, and data transformation becomes paramount. It customizes training data to enhance AI learning efficiency and adapts input formats to suit diverse AI models. Selecting an appropriate transformation technique is paramount in preserving crucial data details. Despite the widespread integration of AI across various industries, comprehensive reviews concerning contemporary data transformation approaches are scarce. This survey explores the intricacies of data heterogeneity and its underlying sources. It systematically categorizes and presents strategies to address heterogeneity stemming from differences in data formats, shedding light on the inherent challenges associated with each strategy.
Robust Explanations Through Uncertainty Decomposition: A Path to Trustworthier AI
Zhu, Chenrui, Bounia, Louenas, Nguyen, Vu Linh, Destercke, Sรฉbastien, Hoarau, Arthur
Recent advancements in machine learning have emphasized the need for transparency in model predictions, particularly as interpretability diminishes when using increasingly complex architectures. In this paper, we propose leveraging prediction uncertainty as a complementary approach to classical explainability methods. Specifically, we distinguish between aleatoric (data-related) and epistemic (model-related) uncertainty to guide the selection of appropriate explanations. Epistemic uncertainty serves as a rejection criterion for unreliable explanations and, in itself, provides insight into insufficient training (a new form of explanation). Aleatoric uncertainty informs the choice between feature-importance explanations and counterfactual explanations. This leverages a framework of explainability methods driven by uncertainty quantification and disentanglement. Our experiments demonstrate the impact of this uncertainty-aware approach on the robustness and attainability of explanations in both traditional machine learning and deep learning scenarios.
LLMs are Bayesian, in Expectation, not in Realization
Chlon, Leon, Rashidi, Sarah, Khamis, Zein, Awada, MarcAntonio M.
Large language models demonstrate remarkable in-context learning capabilities, adapting to new tasks without parameter updates. While this phenomenon has been successfully modeled as implicit Bayesian inference, recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property, a cornerstone requirement of Bayesian updating on exchangeable data. This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications. Our theoretical analysis establishes four key results: (1) positional encodings induce martingale violations of order $ฮ(\log n / n)$; (2) transformers achieve information-theoretic optimality with excess risk $O(n^{-1/2})$ in expectation over orderings; (3) the implicit posterior representation converges to the true Bayesian posterior in the space of sufficient statistics; and (4) we derive the optimal chain-of-thought length as $k^* = ฮ(\sqrt{n}\log(1/\varepsilon))$ with explicit constants, providing a principled approach to reduce inference costs while maintaining performance. Empirical validation on GPT-3 confirms predictions (1)-(3), with transformers reaching 99\% of theoretical entropy limits within 20 examples. Our framework provides practical methods for extracting calibrated uncertainty estimates from position-aware architectures and optimizing computational efficiency in deployment.
Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update
Zhang, Yu-Jie, Xu, Sheng-An, Zhao, Peng, Sugiyama, Masashi
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function, thereby modeling a broad class of reward distributions such as Bernoulli and Poisson. While GLBs are widely applicable to real-world scenarios, their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency. Existing methods typically trade off between two objectives, either incurring high per-round costs for optimal regret guarantees or compromising statistical efficiency to enable constant-time updates. In this paper, we propose a jointly efficient algorithm that attains a nearly optimal regret bound with $\mathcal{O}(1)$ time and space complexities per round. The core of our method is a tight confidence set for the online mirror descent (OMD) estimator, which is derived through a novel analysis that leverages the notion of mix loss from online prediction. The analysis shows that our OMD estimator, even with its one-pass updates, achieves statistical efficiency comparable to maximum likelihood estimation, thereby leading to a jointly efficient optimistic method.
From Observational Data to Clinical Recommendations: A Causal Framework for Estimating Patient-level Treatment Effects and Learning Policies
Gutman, Rom, Sheiba, Shimon, Klein, Omer Noy, Bird, Naama Dekel, Gruber, Amit, Aronson, Doron, Caspi, Oren, Shalit, Uri
We propose a framework for building patient-specific treatment recommendation models, building on the large recent literature on learning patient-level causal models and inspired by the target trial paradigm of Hernan and Robins. We focus on safety and validity, including the crucial issue of causal identification when using observational data. We do not provide a specific model, but rather a way to integrate existing methods and know-how into a practical pipeline. We further provide a real world use-case of treatment optimization for patients with heart failure who develop acute kidney injury during hospitalization. The results suggest our pipeline can improve patient outcomes over the current treatment regime.
h-calibration: Rethinking Classifier Recalibration with Probabilistic Error-Bounded Objective
Huang, Wenjian, Cao, Guiping, Xia, Jiahao, Chen, Jingkun, Wang, Hao, Zhang, Jianguo
Deep neural networks have demonstrated remarkable performance across numerous learning tasks but often suffer from miscalibration, resulting in unreliable probability outputs. This has inspired many recent works on mitigating miscalibration, particularly through post-hoc recalibration methods that aim to obtain calibrated probabilities without sacrificing the classification performance of pre-trained models. In this study, we summarize and categorize previous works into three general strategies: intuitively designed methods, binning-based methods, and methods based on formulations of ideal calibration. Through theoretical and practical analysis, we highlight ten common limitations in previous approaches. To address these limitations, we propose a probabilistic learning framework for calibration called h-calibration, which theoretically constructs an equivalent learning formulation for canonical calibration with boundedness. On this basis, we design a simple yet effective post-hoc calibration algorithm. Our method not only overcomes the ten identified limitations but also achieves markedly better performance than traditional methods, as validated by extensive experiments. We further analyze, both theoretically and experimentally, the relationship and advantages of our learning objective compared to traditional proper scoring rule. In summary, our probabilistic framework derives an approximately equivalent differentiable objective for learning error-bounded calibrated probabilities, elucidating the correspondence and convergence properties of computational statistics with respect to theoretical bounds in canonical calibration. The theoretical effectiveness is verified on standard post-hoc calibration benchmarks by achieving state-of-the-art performance. This research offers valuable reference for learning reliable likelihood in related fields.
Fiducial Matching: Differentially Private Inference for Categorical Data
Romanus, Ogonnaya Michael, Boulaguiem, Younes, Molinari, Roberto
The task of statistical inference, which includes the building of confidence intervals and tests for parameters and effects of interest to a researcher, is still an open area of investigation in a differentially private (DP) setting. Indeed, in addition to the randomness due to data sampling, DP delivers another source of randomness consisting of the noise added to protect an individual's data from being disclosed to a potential attacker. As a result of this convolution of noises, in many cases it is too complicated to determine the stochastic behavior of the statistics and parameters resulting from a DP procedure. In this work, we contribute to this line of investigation by employing a simulation-based matching approach, solved through tools from the fiducial framework, which aims to replicate the data generation pipeline (including the DP step) and retrieve an approximate distribution of the estimates resulting from this pipeline. For this purpose, we focus on the analysis of categorical (nominal) data that is common in national surveys, for which sensitivity is naturally defined, and on additive privacy mechanisms. We prove the validity of the proposed approach in terms of coverage and highlight its good computational and statistical performance for different inferential tasks in simulated and applied data settings.
Fast Variational Bayes for Large Spatial Data
Recent variational Bayes methods for geospatial regression, proposed as an alternative to computationally expensive Markov chain Monte Carlo (MCMC) sampling, have leveraged Nearest Neighbor Gaussian processes (NNGP) to achieve scalability. Yet, these variational methods remain inferior in accuracy and speed compared to spNNGP, the state-of-the-art MCMC-based software for NNGP. We introduce spVarBayes, a suite of fast variational Bayesian approaches for large-scale geospatial data analysis using NNGP. Our contributions are primarily computational. We replace auto-differentiation with a combination of calculus of variations, closed-form gradient updates, and linear response corrections for improved variance estimation. We also accommodate covariates (fixed effects) in the model and offer inference on the variance parameters. Simulation experiments demonstrate that we achieve comparable accuracy to spNNGP but with reduced computational costs, and considerably outperform existing variational inference methods in terms of both accuracy and speed. Analysis of a large forest canopy height dataset illustrates the practical implementation of proposed methods and shows that the inference results are consistent with those obtained from the MCMC approach. The proposed methods are implemented in publicly available Github R-package spVarBayes.
Synthetic Tabular Data Generation: A Comparative Survey for Modern Techniques
Challagundla, Raju, Dorodchi, Mohsen, Wang, Pu, Lee, Minwoo
As privacy regulations become more stringent and access to real-world data becomes increasingly constrained, synthetic data generation has emerged as a vital solution, especially for tabular datasets, which are central to domains like finance, healthcare and the social sciences. This survey presents a comprehensive and focused review of recent advances in synthetic tabular data generation, emphasizing methods that preserve complex feature relationships, maintain statistical fidelity, and satisfy privacy requirements. A key contribution of this work is the introduction of a novel taxonomy based on practical generation objectives, including intended downstream applications, privacy guarantees, and data utility, directly informing methodological design and evaluation strategies. Therefore, this review prioritizes the actionable goals that drive synthetic data creation, including conditional generation and risk-sensitive modeling. Additionally, the survey proposes a benchmark framework to align technical innovation with real-world demands. By bridging theoretical foundations with practical deployment, this work serves as both a roadmap for future research and a guide for implementing synthetic tabular data in privacy-critical environments.