Goto

Collaborating Authors

 Learning Graphical Models


Don't Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation

arXiv.org Machine Learning

Pass$@k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://mohsenhariri.github.io/bayes-kit


Offline Reinforcement Learning in Large State Spaces: Algorithms and Guarantees

arXiv.org Machine Learning

This article introduces the theory of offline reinforcement learning in large state spaces, where good policies are learned from historical data without online interactions with the environment. Key concepts introduced include expressivity assumptions on function approximation (e.g., Bellman completeness vs. realizability) and data coverage (e.g., all-policy vs. single-policy coverage). A rich landscape of algorithms and results is described, depending on the assumptions one is willing to make and the sample and computational complexity guarantees one wishes to achieve. We also discuss open questions and connections to adjacent areas.


Simulation-based inference via telescoping ratio estimation for trawl processes

arXiv.org Machine Learning

The growing availability of large and complex datasets has increased interest in temporal stochastic processes that can capture stylized facts such as marginal skewness, non-Gaussian tails, long memory, and even non-Markovian dynamics. While such models are often easy to simulate from, parameter estimation remains challenging. Simulation-based inference (SBI) offers a promising way forward, but existing methods typically require large training datasets or complex architectures and frequently yield confidence (credible) regions that fail to attain their nominal values, raising doubts on the reliability of estimates for the very features that motivate the use of these models. To address these challenges, we propose a fast and accurate, sample-efficient SBI framework for amortized posterior inference applicable to intractable stochastic processes. The proposed approach relies on two main steps: first, we learn the posterior density by decomposing it sequentially across parameter dimensions. Then, we use Chebyshev polynomial approximations to efficiently generate independent posterior samples, enabling accurate inference even when Markov chain Monte Carlo methods mix poorly. We further develop novel diagnostic tools for SBI in this context, as well as post-hoc calibration techniques; the latter not only lead to performance improvements of the learned inferential tool, but also to the ability to reuse it directly with new time series of varying lengths, thus amortizing the training cost. We demonstrate the method's effectiveness on trawl processes, a class of flexible infinitely divisible models that generalize univariate Gaussian processes, applied to energy demand data.


Neural Bayesian Filtering

arXiv.org Machine Learning

As an example, consider the problem of tracking an autonomous robot with an unknown starting position in a d d grid (Figure 1). Suppose the agent's policy is known, and an observer sees that the agent moved a step without colliding into a wall. This information indicates how the observer should update their beliefs about the agent's position. Tracking these belief states can be challenging when they are either continuous or too large to enumerate (Solinas et al., 2023)--even when the agent's policy and the environment dynamics are known. A common approach frames belief state modeling as a Bayesian filtering problem in which a posterior is maintained and updated with each new observation. Classical Bayesian filters, such as the Kalman Filter (Kalman, 1960) and its nonlinear variants (e.g., Extended and Unscented Kalman Filters (Sorenson, 1985; Julier & Uhlmann, 2004)), assume that the underlying distributions are unimodal and approximately Gaussian. While computationally efficient, this limits their applicability in settings that do not satisfy these assumptions.


Exact and Approximate MCMC for Doubly-intractable Probabilistic Graphical Models Leveraging the Underlying Independence Model

arXiv.org Machine Learning

Bayesian inference for doubly-intractable probabilistic graphical models typically involves variations of the exchange algorithm or approximate Markov chain Monte Carlo (MCMC) samplers. However, existing methods for both classes of algorithms require either perfect samplers or sequential samplers for complex models, which are often either not available, or suffer from poor mixing, especially in high dimensions. We develop a method that does not require perfect or sequential sampling, and can be applied to both classes of methods: exact and approximate MCMC. The key to our approach is to utilize the tractable independence model underlying an intractable probabilistic graphical model for the purpose of constructing a finite sample unbiased Monte Carlo (and not MCMC) estimate of the Metropolis--Hastings ratio. This innovation turns out to be crucial for scalability in high dimensions. The method is demonstrated on the Ising model. Gradient-based alternatives to construct a proposal, such as Langevin and Hamiltonian Monte Carlo approaches, also arise as a natural corollary to our general procedure, and are demonstrated as well.


Quantum Fisher information matrices from Rényi relative entropies

arXiv.org Artificial Intelligence

Quantum generalizations of the Fisher information are important in quantum information science, with applications in high energy and condensed matter physics and in quantum estimation theory, machine learning, and optimization. One can derive a quantum generalization of the Fisher information matrix in a natural way as the Hessian matrix arising in a Taylor expansion of a smooth divergence. Such an approach is appealing for quantum information theorists, given the ubiquity of divergences in quantum information theory. In contrast to the classical case, there is not a unique quantum generalization of the Fisher information matrix, similar to how there is not a unique quantum generalization of the relative entropy or the Rényi relative entropy. In this paper, I derive information matrices arising from the log-Euclidean, $α$-$z$, and geometric Rényi relative entropies, with the main technical tool for doing so being the method of divided differences for calculating matrix derivatives. Interestingly, for all non-negative values of the Rényi parameter $α$, the log-Euclidean Rényi relative entropy leads to the Kubo-Mori information matrix, and the geometric Rényi relative entropy leads to the right-logarithmic derivative Fisher information matrix. Thus, the resulting information matrices obey the data-processing inequality for all non-negative values of the Rényi parameter $α$ even though the original quantities do not. Additionally, I derive and establish basic properties of $α$-$z$ information matrices resulting from the $α$-$z$ Rényi relative entropies. For parameterized thermal states and time-evolved states, I establish formulas for their $α$-$z$ information matrices and hybrid quantum-classical algorithms for estimating them, with applications in quantum Boltzmann machine learning.


LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo

arXiv.org Artificial Intelligence

The Learning With Disagreements (LeWiDi) 2025 shared task aims to model annotator disagreement through soft label distribution prediction and perspectivist evaluation, which focuses on modeling individual annotators. We adapt DisCo (Distribution from Context), a neural architecture that jointly models item-level and annotator-level label distributions, and present detailed analysis and improvements. In this paper, we extend DisCo by introducing annotator metadata embeddings, enhancing input representations, and multi-objective training losses to capture disagreement patterns better. Through extensive experiments, we demonstrate substantial improvements in both soft and perspectivist evaluation metrics across three datasets. We also conduct in-depth calibration and error analyses that reveal when and why disagreement-aware modeling improves. Our findings show that disagreement can be better captured by conditioning on annotator demographics and by optimizing directly for distributional metrics, yielding consistent improvements across datasets.


MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

arXiv.org Artificial Intelligence

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.


Rethinking Probabilistic Circuit Parameter Learning

arXiv.org Artificial Intelligence

Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. Although empirical extensions to the mini-batch setting, as well as gradient-based mini-batch algorithms, converge faster than full-batch EM, they generally underperform in terms of final likelihood. We investigate this gap by establishing a novel theoretical connection between these practical algorithms and the general EM objective. Our analysis reveals a fundamental issue that existing mini-batch EM and gradient-based methods fail to properly regularize distribution changes, causing each update to effectively ``overfit'' the current mini-batch. Motivated by this insight, we introduce anemone, a new mini-batch EM algorithm for PCs. Anemone applies an implicit adaptive learning rate to each parameter, scaled by how much it contributes to the likelihood of the current batch. Across extensive experiments on language, image, and DNA datasets, anemone consistently outperforms existing optimizers in both convergence speed and final performance.


Conformalized Generative Bayesian Imaging: An Uncertainty Quantification Framework for Computational Imaging

arXiv.org Artificial Intelligence

Uncertainty quantification plays an important role in achieving trustworthy and reliable learning-based computational imaging. Recent advances in generative modeling and Bayesian neural networks have enabled the development of uncertainty-aware image reconstruction methods. Current generative model-based methods seek to quantify the inherent (aleatoric) uncertainty on the underlying image for given measurements by learning to sample from the posterior distribution of the underlying image. On the other hand, Bayesian neural network-based approaches aim to quantify the model (epistemic) uncertainty on the parameters of a deep neural network-based reconstruction method by approximating the posterior distribution of those parameters. Unfortunately, an ongoing need for an inversion method that can jointly quantify complex aleatoric uncertainty and epistemic uncertainty patterns still persists. In this paper, we present a scalable framework that can quantify both aleatoric and epistemic uncertainties. The proposed framework accepts an existing generative model-based posterior sampling method as an input and introduces an epistemic uncertainty quantification capability through Bayesian neural networks with latent variables and deep ensembling. Furthermore, by leveraging the conformal prediction methodology, the proposed framework can be easily calibrated to ensure rigorous uncertainty quantification. We evaluated the proposed framework on magnetic resonance imaging, computed tomography, and image inpainting problems and showed that the epistemic and aleatoric uncertainty estimates produced by the proposed framework display the characteristic features of true epistemic and aleatoric uncertainties. Furthermore, our results demonstrated that the use of conformal prediction on top of the proposed framework enables marginal coverage guarantees consistent with frequentist principles.