AITopics | supervised reinforcement learning

Collaborating Authors

supervised reinforcement learning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Deng, Yihe, Hsu, I-Hung, Yan, Jun, Wang, Zifeng, Han, Rujun, Zhang, Gufeng, Chen, Yanfei, Wang, Wei, Pfister, Tomas, Lee, Chen-Yu

arXiv.org Artificial IntelligenceOct-31-2025

Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

large language model, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2510.25992

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning

Lanier, Michael, Xu, Ying, Jacobs, Nathan, Zhang, Chongjie, Vorobeychik, Yevgeniy

arXiv.org Artificial IntelligenceFeb-14-2024

Deep reinforcement learning has demonstrated remarkable achievements across diverse domains such as video games, robotic control, autonomous driving, and drug discovery. Common methodologies in partially-observable domains largely lean on end-to-end learning from high-dimensional observations, such as images, without explicitly reasoning about true state. We suggest an alternative direction, introducing the Partially Supervised Reinforcement Learning (PSRL) framework. At the heart of PSRL is the fusion of both supervised and unsupervised learning. The approach leverages a state estimator to distill supervised semantic state information from high-dimensional observations which are often fully observable at training time. This yields more interpretable policies that compose state predictions with control. In parallel, it captures an unsupervised latent representation. These two-the semantic state and the latent state-are then fused and utilized as inputs to a policy network. This juxtaposition offers practitioners a flexible and dynamic spectrum: from emphasizing supervised state information to integrating richer, latent insights. Extensive experimental results indicate that by merging these dual representations, PSRL offers a potent balance, enhancing model interpretability while preserving, and often significantly outperforming, the performance benchmarks set by traditional methods in terms of reward and convergence speed.

information, learning, reinforcement, (14 more...)

arXiv.org Artificial Intelligence

2402.0929

Country:

North America > United States (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Transportation (0.48)
Leisure & Entertainment > Games > Computer Games (0.48)
Information Technology > Robotics & Automation (0.34)
Automobiles & Trucks (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.85)

Add feedback

Supervised Reinforcement Learning with Recurrent Neural Network for Dynamic Treatment Recommendation

Wang, Lu, Zhang, Wei, He, Xiaofeng, Zha, Hongyuan

arXiv.org Machine LearningJul-4-2018

Dynamic treatment recommendation systems based on large-scale electronic health records (EHRs) become a key to successfully improve practical clinical outcomes. Prior relevant studies recommend treatments either use supervised learning (e.g. matching the indicator signal which denotes doctor prescriptions), or reinforcement learning (e.g. maximizing evaluation signal which indicates cumulative reward from survival rates). However, none of these studies have considered to combine the benefits of supervised learning and reinforcement learning. In this paper, we propose Supervised Reinforcement Learning with Recurrent Neural Network (SRL-RNN), which fuses them into a synergistic learning framework. Specifically, SRL-RNN applies an off-policy actor-critic framework to handle complex relations among multiple medications, diseases and individual characteristics. The "actor" in the framework is adjusted by both the indicator signal and evaluation signal to ensure effective prescription and low mortality. RNN is further utilized to solve the Partially-Observed Markov Decision Process (POMDP) problem due to the lack of fully observed states in real world applications. Experiments on the publicly real-world dataset, i.e., MIMIC-3, illustrate that our model can reduce the estimated mortality, while providing promising accuracy in matching doctors' prescriptions.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

1807.01473

Country:

North America > United States > District of Columbia > Washington (0.05)
Asia > China (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

Toward Supervised Reinforcement Learning with Partial States for Social HRI

Senft, Emmanuel (Plymouth University) | Lemaignan, Séverin (Plymouth University) | Baxter, Paul (University of Lincoln) | Belpaeme, Tony (Plymouth University)

AAAI ConferencesOct-31-2017

Social interacting is a complex task for which machine learning holds particular promise. However, as no sufficiently accurate simulator of human interactions exists today, the learning of social interaction strategies has to happen online in the real world. Actions executed by the robot impact on humans, and as such have to be carefully selected, making it impossible to rely on random exploration. Additionally, no clear reward function exists for social interactions. This implies that traditional approaches used for Reinforcement Learning cannot be directly applied for learning how to interact with the social world. As such we argue that robots will profit from human expertise and guidance to learn social interactions. However, as the quantity of input a human can provide is limited, new methods have to be designed to use human input more efficiently. In this paper we describe a setup in which we combine a framework called Supervised Progressively Autonomous Robot Competencies (SPARC), which allows safer online learning with Reinforcement Learning, with the use of partial states rather than full states to accelerate generalisation and obtain a usable action policy more quickly.

artificial intelligence, machine learning, supervised reinforcement learning, (2 more...)

AAAI Conferences

2017 AAAI Fall Symposium Series

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.80)

Add feedback