Goto

Collaborating Authors

 Learning Graphical Models




Wavefunction Flows: Efficient Quantum Simulation of Continuous Flow Models

arXiv.org Machine Learning

Flow models are a cornerstone of modern machine learning. They are generative models that progressively transform probability distributions according to learned dynamics. Specifically, they learn a continuous-time Markov process that efficiently maps samples from a simple source distribution into samples from a complex target distribution. We show that these models are naturally related to the Schrödinger equation, for an unusual Hamiltonian on continuous variables. Moreover, we prove that the dynamics generated by this Hamiltonian can be efficiently simulated on a quantum computer. Together, these results give a quantum algorithm for preparing coherent encodings (a.k.a., qsamples) for a vast family of probability distributions--namely, those expressible by flow models--by reducing the task to an existing classical learning problem, plus Hamiltonian simulation. For statistical problems defined by flow models, such as mean estimation and property testing, this enables the use of quantum algorithms tailored to qsamples, which may offer advantages over classical algorithms based only on samples from a flow model. More broadly, these results reveal a close connection between state-of-the-art machine learning models, such as flow matching and diffusion models, and one of the main expected capabilities of quantum computers: simulating quantum dynamics.


On the Optimality of Tracking Fisher Information in Adaptive Testing with Stochastic Binary Responses

arXiv.org Machine Learning

Adaptive testing and sequential estimation problems have recently gained substantial attention due to their foundational role in modern artificial intelligence and interactive systems. Prominent applications include online preference learning, where systems dynamically adapt to user feedback to refine personalized recommendations, and reinforcement learning from human feedback (RLHF), which aims to align AI agents with human values by adaptively querying users. In these contexts, the main focus is to efficiently extract maximal information from human responses, which are inherently stochastic and limited in quantity. Among various types of such problems, this work particularly considers a fundamental yet illustrative case involving stochastic binary responses. Here, a decision-maker sequentially selects questions of varying difficulty from a continuous pool to pose to a candidate and aims to efficiently estimate the candidate's ability (represented by an unknown continuous parameter) by utilizing the binary feedback (e.g., correct/incorrect) collected, which depends probabilistically on the candidate's ability and the question's difficulty. This setup is arguably the simplest scenario that captures the essence of continuous parameter estimation under uncertainty, making it an ideal benchmark for developing fundamental theoretical insights and practical algorithms. Variants of this fundamental adaptive estimation problem have been studied in several communities.


A Honest Cross-Validation Estimator for Prediction Performance

arXiv.org Machine Learning

Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the model performance across different data splits. A well-known criticism is that such cross-validation procedure does not directly estimate the performance of the particular model recommended for future use. In this paper, we propose a new method to estimate the performance of a model trained on a specific (random) training set. A naive estimator can be obtained by applying the model to a disjoint testing set. Surprisingly, cross-validation estimators computed from other random splits can be used to improve this naive estimator within a random-effects model framework. We develop two estimators -- a hierarchical Bayesian estimator and an empirical Bayes estimator -- that perform similarly to or better than both the conventional cross-validation estimator and the naive single-split estimator. Simulations and a real-data example demonstrate the superior performance of the proposed method.


From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation

arXiv.org Machine Learning

Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .


WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

arXiv.org Artificial Intelligence

While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.


DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model

arXiv.org Artificial Intelligence

Achieving generalized in-hand object rotation remains a significant challenge in robotics, largely due to the difficulty of transferring policies from simulation to the real world. The complex, contact-rich dynamics of dexterous manipulation create a "reality gap" that has limited prior work to constrained scenarios involving simple geometries, limited object sizes and aspect ratios, constrained wrist poses, or customized hands. We address this sim-to-real challenge with a novel framework that enables a single policy, trained in simulation, to generalize to a wide variety of objects and conditions in the real world. The core of our method is a joint-wise dynamics model that learns to bridge the reality gap by effectively fitting limited amount of real-world collected data and then adapting the sim policy's actions accordingly. The model is highly data-efficient and generalizable across different whole-hand interaction distributions by factorizing dynamics across joints, compressing system-wide influences into low-dimensional variables, and learning each joint's evolution from its own dynamic profile, implicitly capturing these net effects. We pair this with a fully autonomous data collection strategy that gathers diverse, real-world interaction data with minimal human intervention. Our complete pipeline demonstrates unprecedented generality: a single policy successfully rotates challenging objects with complex shapes (e.g., animals), high aspect ratios (up to 5.33), and small sizes, all while handling diverse wrist orientations and rotation axes. Comprehensive real-world evaluations and a teleoperation application for complex tasks validate the effectiveness and robustness of our approach. Website: https://meowuu7.github.io/DexNDM/


If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models

arXiv.org Artificial Intelligence

Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the $\textit{conditional probability}$ of $B$ given $A$, and the $\textit{semantic relevance}$ of the antecedent $A$ given the consequent $B$ (i.e., whether $A$ meaningfully supports $B$). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the $\textit{acceptability}$ of such statements. To address this gap, we present a comprehensive study of LLMs' conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance-though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.


Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

arXiv.org Artificial Intelligence

Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.