AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Adaptive state-action abstractions via rate-distortion

Rosas, Fernando E.

arXiv.org Machine LearningJun-5-2026

When learning to walk, infants seem to address a coarse version of the problem first - stay upright, reach the caregiver - and refine it only when further practice at that resolution stops paying off. Reinforcement learning offers multiple techniques for building simple versions of complex tasks, but lacks general principles for how to dynamically adjust the granularity of these abstractions during learning. This paper proposes one such principle: refine the abstraction as soon as the learning error within it becomes comparable to the error induced by the abstraction itself. Here, we investigate one way of formalising this principle via a performance certificate that decomposes value error into two terms: a learning error bound captured by a Bellman residual, and an abstraction error bound given by a bisimulation metric. The resulting switching strategy is implemented by soft state-action abstractions built from rate-distortion principles, whose resolution along state and action axes can be continuously adjusted. We validate this construction in a range of tabular settings, showing that near-optimal performance can be achieved under substantial lossy compression of state and action information.

abstraction, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

2606.06123

Country: North America > United States (0.93)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)

Add feedback

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

Dann, Christoph, Mansour, Yishay, Mohri, Mehryar

arXiv.org Machine LearningMay-29-2026

Model-based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a reality gap where policies succeed in simulation but fail in the real world. We propose that the objective for learning simulators should be strategic robustness rather than predictive accuracy, and formulate this as a zero-sum minimax game between a model player and an adversarial policy player. We provide a comprehensive theoretical analysis: (1) an online learning guarantee showing the game is learnable with sublinear regret bounds; (2) a tractable critic-based simplification bounding the global policy-value gap by the local critic's loss; and (3) an Error-MDP duality, proving that finding the worst-case policy is formally dual to a standard RL problem where the reward is the one-step critic error. This duality yields a provably convergent active data selection algorithm. Experiments on continuous control tasks demonstrate that our approach reduces prediction error in strategically important regions by $1.5$-$2.2\times$ and enables policies trained purely in simulation to match near-optimal real-world performance.

machine learning, reinforcement learning, simulator, (13 more...)

arXiv.org Machine Learning

2605.29032

Genre: Research Report (0.81)

Industry:

Education (0.48)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Accelerating Reinforcement Learning Training Using Simulation Surrogate Models

Ghasemloo, Mohammadmahdi, Eckman, David J., Li, Yaxian

arXiv.org Machine LearningMay-28-2026

High-fidelity simulation models are widely used to analyze complex stochastic systems, but their high computational cost motivates the development of cheaper surrogate models that approximate the simulation model's input-output relationship. In parallel, reinforcement learning (RL) has emerged as a powerful framework for making online decisions in stochastic environments, with increasing attention being given to the use of simulation models as training environments for RL models. We investigate a class of surrogate models suitable for accelerating RL training in settings where the reward structure, model parameters, or system dynamics change over time and explore their interactions with simulation models and RL models. Through numerical experiments on a stochastic service system modeled via discrete-event simulation, we demonstrate that leveraging surrogate models can substantially accelerate RL training and re-training.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

2605.27556

Country:

North America > United States > Texas (0.14)
North America > United States > New York (0.14)
North America > United States > New Jersey (0.14)
North America > Canada > Quebec (0.14)

Genre: Research Report (0.50)

Industry: Education > Educational Technology > Educational Software > Computer Based Training (0.35)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Reward Transfer from Inverse Reinforcement Learning: A Coupled Minimax Approach

Hao, Guang-Yuan, van der Laan, Lars, Bibaut, Aurélien, Kallus, Nathan

arXiv.org Machine LearningMay-28-2026

Expert demonstrations, such as those from car drivers, help navigate environments with unknown rewards, but are often collected in controlled settings, such as closed-course test tracks, while learned control policies must be deployed in new environments, such as city streets. We can imitate experts to perform well in the same source environment where demonstrations are observed, and we may even use inverse reinforcement learning (IRL) to improve on simple behavior cloning (Ng and Russell, 2000; Abbeel and Ng, 2004; Ziebart et al., 2008; Fu et al., 2018; Geng et al., 2020). But the target environment may have a different transition law, discount factor, or soft-control regularization. For this, IRL is crucial: we can learn a reward from demonstrations in the source environment and transfer it to the target environment, learning a policy that optimizes the same reward function in a new setting (Fu et al., 2018; Schlaginhaufen and Kamgarpour, 2024). In this paper, we characterize how well this transfer can be done and which approaches are preferable. In particular, we show the value in a coupled approach that takes the target environment into account even when learning from the source. In ordinary offline control, the Bellman equation uses a known reward, so the main statistical error comes from target transitions.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2605.27834

Genre: Research Report (0.63)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation

Kim, Wonyoung, Oh, Min-Hwan, Iyengar, Garud, Zeevi, Assaf

arXiv.org Machine LearningMay-28-2026

Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance depends on the variability of the interaction between the learner and the environment. In this paper, we develop a new theoretical analysis for MNL-based Markov decision processes that yields explicit variance-adaptive regret bounds. Our algorithm is computationally efficient and achieves the instance-wise optimal rate of regret, narrowing the gap between upper and lower bounds. Our numerical experiments validate that our method learns optimal policies more efficiently than conventional approaches.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2605.28364

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.84)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.60)

Add feedback

Fast Convergence of Policy Regret in Learning Stochastic Optimal Control

Wang, Shengbo, Blanchet, Jose, Glynn, Peter

arXiv.org Machine LearningMay-27-2026

Policy learning in modern operations environments faces a fundamental tension between limited operational data and the large, often continuous, state and action spaces over which good decisions must be identified and deployed. We study value-based policy learning in stochastic optimal control: a greedy policy induced by an estimate of the optimal action-value function $Q^*$ is deployed, and its performance is measured by regret. The empirical success of this approach calls for statistical insight into the structures that enable fast regret convergence. We show that, in continuous action spaces, fast policy learning is induced by three geometric structures: a growth exponent $p$, which quantifies how quickly $Q^*$ separates suboptimal actions from its maximizers; a margin-mass exponent $m$, which controls how much deployment mass lies on states with weak growth; and an action-wise regularity exponent $q$, which measures the smoothness of the $Q^*$-estimation error across actions. Given a $n^{-1/2}$-accurate estimator of $Q^*$, we show that the minimax-optimal policy regret convergence rate is \[ \widetildeΘ\left( n^{-\min\left\{\frac{p}{2(p-q)},\frac{m+1}{2m}\right\}} \right), \] up to a logarithmic factor at the boundary between the two regimes. The exponent $q$ is crucial: $q>0$ yields faster-than-$n^{-1/2}$ regret. This regime is natural in operations applications. In particular, we verify $q>0$ under mild regularity conditions in dynamic inventory control and service allocation examples, while the mechanism underlying this fast rate regime extends beyond these settings.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2605.26361

Country: North America > United States (0.67)

Genre: Research Report (0.81)

Industry: Education (0.45)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

Kiyohara, Haruka, Curmei, Mihaela, Evnine, Ariel, Kalyanaraman, Shankar, Nir, Israel, Pop, Ana-Roxana, Razin, Nitzan, Dean, Sarah, Joachims, Thorsten, Weinsberg, Udi

arXiv.org Machine LearningMay-27-2026

Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.

large language model, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2605.26385

Country: Asia (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Efficient Preference Poisoning Attack on Offline RLHF

Yang, Chenye, Xu, Weiyu, Lai, Lifeng

arXiv.org Machine LearningMay-26-2026

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lovász reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for $K$-flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

2605.02495

Country:

Europe (1.00)
Asia (1.00)
North America > United States > California (0.92)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Learning Kernel-Based MDPs from Episodic Preferential Feedback

Pavlovic, Nikola, Vakili, Sattar, Zhao, Qing

arXiv.org Machine LearningMay-26-2026

Human feedback often arrives as preferences rather than calibrated numeric rewards, motivating reinforcement learning from preferential feedback, also referred to as reinforcement learning from human feedback (RLHF). We present a rigorous theoretical study of preference-only learning in episodic kernel MDPs. In each episode, the learner deploys two policies from a common start state and receives a single binary label indicating which trajectory is preferred, modeled by a Bradley--Terry--Luce link on the difference of cumulative (unobserved) rewards. Under kernel-based assumptions on the reward and transition functions (one of the most general models amenable to theoretical analysis) we develop preference-based value estimation and confidence sets tailored to end-of-episode comparisons. We prove high-probability regret bounds that scale sublinearly in the number of episodes, implying that the value of the learned policy converges to that of the optimal policy.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Machine Learning

2605.2365

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Causality as the Statistical Conscience of Artificial Intelligence: From Pearl's Ladder to Trustworthy Machines

Fokoué, Ernest

arXiv.org Machine LearningMay-26-2026

Modern Artificial Intelligence achieves remarkable predictive power by optimizing statistical risk functionals over vast corpora. Yet a gap separates this from genuine intelligence: the inability to distinguish correlation from causation. This paper argues that causal inference (identifying mechanisms invariant under intervention) is AI's indispensable statistical conscience. Without causal grounding, AI systems are correlation machines: powerful in familiar domains, brittle under distribution shift, and biased in high-stakes settings. Three contributions develop this argument. First, a Statistical Necessity Theorem for Causal Generalization: any algorithm achieving out-of-distribution generalization must encode causal structure, formalizing the distinction between prediction P(Y|X) and intelligence P(Y|do(X)). Second, a unified framework connects Pearl's do-calculus, the Potential Outcomes framework, Double Machine Learning, and Invariant Risk Minimization as a family of Causal Statistical Estimators, each identifying interventional distributions under different assumptions. Third, three AI failure modes (hallucination in large language models, reward hacking in reinforcement learning from human feedback, and degradation under distribution shift) are manifestations of causal blindness, each admitting a principled statistical remedy. Trustworthy AI is, at its core, a problem of causal statistics. The statistical community is not merely equipped to solve it -- it is the only community with the foundational tools to do so rigorously.

large language model, machine learning, reinforcement learning, (21 more...)

arXiv.org Machine Learning

2605.24076

Country: North America > United States (0.28)

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Diagnostic Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)

Add feedback