AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning

Francisco Garcia, Philip S. Thomas

Neural Information Processing SystemsAug-20-2025, 01:17:56 GMT

Neural Information Processing Systems http://nips.cc/

agent, exploration, exploration policy, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Constrained Reinforcement Learning Has Zero Duality Gap

Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, Alejandro Ribeiro

Neural Information Processing SystemsAug-20-2025, 01:17:21 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, dual problem, parametrization, (14 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Pennsylvania (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Industry: Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)

Add feedback

Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning

Wenjie Shi, Shiji Song, Hui Wu, Ya-Chu Hsu, Cheng Wu, Gao Huang

Neural Information Processing SystemsAug-20-2025, 00:30:36 GMT

Neural Information Processing Systems http://nips.cc/

anderson acceleration, iteration, policy iteration, (13 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs

Marek Petrik, Reazul Hasan Russel

Neural Information Processing SystemsAug-20-2025, 00:18:46 GMT

Robust MDPs (RMDPs) can be used to compute policies with provable worst-case guarantees in reinforcement learning.

ambiguity, transition probability, value function, (11 more...)

Neural Information Processing Systems

Country:

North America > United States > New Hampshire (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Data Science (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)

Add feedback

Machine Teaching of Active Sequential Learners

Tomi Peltola, Mustafa Mert Çelikok, Pedram Daee, Samuel Kaski

Neural Information Processing SystemsAug-20-2025, 00:18:11 GMT

Neural Information Processing Systems http://nips.cc/

learner, probability, teaching, (15 more...)

Neural Information Processing Systems

Country:

Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.68)

Industry: Education (1.00)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
(2 more...)

Add feedback

b5b03f06271f8917685d14cea7c6c50a-Paper.pdf

Neural Information Processing SystemsAug-20-2025, 00:09:02 GMT

evaluation, reinforcement learning, softopc, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.98)
Information Technology > Artificial Intelligence > Robots (0.96)

Add feedback

Minimizing the Weighted Number of Tardy Jobs: Data-Driven Heuristic for Single-Machine Scheduling

Antonov, Nikolai, Šůcha, Prěmysl, Janota, Mikoláš, Hůla, Jan

arXiv.org Machine LearningAug-20-2025

Existing research on single-machine scheduling is largely focused on exact algorithms, which perform well on typical instances but can significantly deteriorate on certain regions of the problem space. In contrast, data-driven approaches provide strong and scalable performance when tailored to the structure of specific datasets. Leveraging this idea, we focus on a single-machine scheduling problem where each job is defined by its weight, duration, due date, and deadline, aiming to minimize the total weight of tardy jobs. We introduce a novel data-driven scheduling heuristic that combines machine learning with problem-specific characteristics, ensuring feasible solutions, which is a common challenge for ML-based algorithms. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art in terms of optimality gap, number of optimal solutions, and adaptability across varied data scenarios, highlighting its flexibility for practical applications. In addition, we conduct a systematic exploration of ML models, addressing a common gap in similar studies by offering a detailed model selection process and providing insights into why the chosen model is the best fit.

artificial intelligence, machine learning, reinforcement learning, (21 more...)

arXiv.org Machine Learning

2508.13703

Country:

Europe > Czechia > Prague (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Health & Medicine (1.00)
Transportation (0.68)
Energy > Renewable (0.46)
Energy > Power Industry (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Learning from Preferences and Mixed Demonstrations in General Settings

Brown, Jason R, Ek, Carl Henrik, Mullins, Robert D

arXiv.org Artificial IntelligenceAug-20-2025

Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are often ad-hoc, rely on domain-specific properties, or won't scale. We develop a new framing for learning from human data, \emph{reward-rational partial orderings over observations}, designed to be flexible and scalable. Based on this we introduce a practical algorithm, LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can learn from a broad range of data, including negative demonstrations, to efficiently learn reward functions across a wide range of domains. We find that when a limited amount of preference and demonstration feedback is available, LEOPARD outperforms existing baselines by a significant margin. Furthermore, we use LEOPARD to investigate learning from many types of feedback compared to just a single one, and find that combining feedback types is often beneficial.

demonstration, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2508.14027

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.94)

Add feedback

Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control

Islam, SM Mazharul, Huber, Manfred

arXiv.org Artificial IntelligenceAug-20-2025

A policy in deep reinforcement learning (RL), either deterministic or stochastic, is commonly parameterized as a Gaussian distribution alone, limiting the learned behavior to be unimodal. However, the nature of many practical decision-making problems favors a multimodal policy that facilitates robust exploration of the environment and thus to address learning challenges arising from sparse rewards, complex dynamics, or the need for strategic adaptation to varying contexts. This issue is exacerbated in continuous control domains where exploration usually takes place in the vicinity of the predicted optimal action, either through an additive Gaussian noise or the sampling process of a stochastic policy. In this paper, we introduce Categorical Policies to model multimodal behavior modes with an intermediate categorical distribution, and then generate output action that is conditioned on the sampled mode. We explore two sampling schemes that ensure differentiable discrete latent structure while maintaining efficient gradient-based optimization. By utilizing a latent categorical distribution to select the behavior mode, our approach naturally expresses multimodality while remaining fully differentiable via the sampling tricks. We evaluate our multimodal policy on a set of DeepMind Control Suite environments, demonstrating that through better exploration, our learned policies converge faster and outperform standard Gaussian policies. Our results indicate that the Categorical distribution serves as a powerful tool for structured exploration and multimodal behavior representation in continuous control.

arxiv preprint arxiv, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2508.13922

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Toward Deployable Multi-Robot Collaboration via a Symbolically-Guided Decision Transformer

Rasanji, Rathnam Vidushika, Wei-Kocsis, Jin, Zhang, Jiansong, Gan, Dongming, Athinarayanan, Ragu, Asunda, Paul

arXiv.org Artificial IntelligenceAug-20-2025

Reinforcement learning (RL) has demonstrated great potential in robotic operations. However, its data-intensive nature and reliance on the Markov Decision Process (MDP) assumption limit its practical deployment in real-world scenarios involving complex dynamics and long-term temporal dependencies, such as multi-robot manipulation. Decision Transformers (DTs) have emerged as a promising offline alternative by leveraging causal transformers for sequence modeling in RL tasks. However, their applications to multi-robot manipulations still remain underexplored. To address this gap, we propose a novel framework, Symbolically-Guided Decision Transformer (SGDT), which integrates a neuro-symbolic mechanism with a causal transformer to enable deployable multi-robot collaboration. In the proposed SGDT framework, a neuro-symbolic planner generates a high-level task-oriented plan composed of symbolic subgoals. Guided by these subgoals, a goal-conditioned decision transformer (GCDT) performs low-level sequential decision-making for multi-robot manipulation. This hierarchical architecture enables structured, interpretable, and generalizable decision making in complex multi-robot collaboration tasks. We evaluate the performance of SGDT across a range of task scenarios, including zero-shot and few-shot scenarios. To our knowledge, this is the first work to explore DT-based technology for multi-robot manipulation.

large language model, machine learning, reinforcement learning, (21 more...)

arXiv.org Artificial Intelligence

2508.13877

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.49)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.48)

Add feedback