AITopics | suboptimal policy

Collaborating Authors

suboptimal policy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Super Hard

Neural Information Processing SystemsFeb-15-2026, 04:46:57 GMT

We thank all the reviewers for their feedback. All reviewers are concerned whether we substantially outperform QMIX. Since StarCraft II experiments take a long time, we could not include all the results in the submission. Samvelyan et al. have classified as Easy, Hard & Super Hard. Results on several maps are shown below.

artificial intelligence, exploration, representational constraint, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.30)

Add feedback

Dr Jekyll & Mr Hyde: the strange case of off-policy policy updates

Neural Information Processing SystemsDec-24-2025, 22:27:05 GMT

The policy gradient theorem states that the policy should only be updated in states that are visited by the current policy, which leads to insufficient planning in the off-policy states, and thus to convergence to suboptimal policies. We tackle this planning issue by extending the policy gradient theory to policy updates with respect to any state density. Under these generalized policy updates, we show convergence to optimality under a necessary and sufficient condition on the updates' state densities, and thereby solve the aforementioned planning issue. We also prove asymptotic convergence rates that significantly improve those in the policy gradient literature. To implement the principles prescribed by our theory, we propose an agent, Dr Jekyll & Mr Hyde (J&H), with a double personality: Dr Jekyll purely exploits while Mr Hyde purely explores. J&H's independent policies allow to record two separate replay buffers: one on-policy (Dr Jekyll's) and one off-policy (Mr Hyde's), and therefore to update J&H's models with a mixture of on-policy and off-policy updates. More than an algorithm, J&H defines principles for actor-critic algorithms to satisfy the requirements we identify in our analysis. We extensively test on finite MDPs where J&H demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results.

dr jekyll, policy update, strange case, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

Super Hard

Neural Information Processing SystemsAug-20-2025, 09:46:32 GMT

exploration, qtran, representational constraint, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.30)

Add feedback

Greedy Algorithm for Structured Bandits: A Sharp Characterization of Asymptotic Success / Failure

Slivkins, Aleksandrs, Xu, Yunzong, Zuo, Shiliang

arXiv.org Artificial IntelligenceMar-5-2025

We study the greedy (exploitation-only) algorithm in bandit problems with a known reward structure. We allow arbitrary finite reward structures, while prior work focused on a few specific ones. We fully characterize when the greedy algorithm asymptotically succeeds or fails, in the sense of sublinear vs. linear regret as a function of time. Our characterization identifies a partial identifiability property of the problem instance as the necessary and sufficient condition for the asymptotic success. Notably, once this property holds, the problem becomes easy--any algorithm will succeed (in the same sense as above), provided it satisfies a mild non-degeneracy condition. We further extend our characterization to contextual bandits and interactive decision-making with arbitrary feedback, and demonstrate its broad applicability across various examples. Keywords: Multi-armed bandits, contextual bandits, structured bandits, greedy algorithm, regret.

algorithm, bandit, probability, (12 more...)

arXiv.org Artificial Intelligence

2503.0401

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.87)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.81)

Add feedback

CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning

Zhao, Yanxiao, Qian, Yangge, Shan, Jingyang, Qin, Xiaolin

arXiv.org Artificial IntelligenceFeb-17-2025

Reinforcement learning (RL) in continuous action spaces encounters persistent challenges, such as inefficient exploration and convergence to suboptimal solutions. To address these limitations, we propose CAMEL, a novel framework integrating LLM-generated suboptimal policies into the RL training pipeline. CAMEL leverages dynamic action masking and an adaptive epsilon-masking mechanism to guide exploration during early training stages while gradually enabling agents to optimize policies independently. At the core of CAMEL lies the integration of Python-executable suboptimal policies generated by LLMs based on environment descriptions and task objectives. Although simplistic and hard-coded, these policies offer valuable initial guidance for RL agents. To effectively utilize these priors, CAMEL employs masking-aware optimization to dynamically constrain the action space based on LLM outputs. Additionally, epsilon-masking gradually reduces reliance on LLM-generated guidance, enabling agents to transition from constrained exploration to autonomous policy refinement. Experimental validation on Gymnasium MuJoCo environments demonstrates the effectiveness of CAMEL. In Hopper-v4 and Ant-v4, LLM-generated policies significantly improve sample efficiency, achieving performance comparable to or surpassing expert masking baselines. For Walker2d-v4, where LLMs struggle to accurately model bipedal gait dynamics, CAMEL maintains robust RL performance without notable degradation, highlighting the framework's adaptability across diverse tasks. While CAMEL shows promise in enhancing sample efficiency and mitigating convergence challenges, these issues remain open for further research. Future work aims to generalize CAMEL to multimodal LLMs for broader observation-action spaces and automate policy evaluation, reducing human intervention and enhancing scalability in RL training pipelines.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.11896

Country: Asia (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

SplAgger: Split Aggregation for Meta-Reinforcement Learning

Beck, Jacob, Jackson, Matthew, Vuorio, Risto, Xiong, Zheng, Whiteson, Shimon

arXiv.org Artificial IntelligenceJun-1-2024

A core ambition of reinforcement learning (RL) is the creation of agents capable of rapid learning in novel tasks. Meta-RL aims to achieve this by directly learning such agents. Black box methods do so by training off-the-shelf sequence models end-to-end. By contrast, task inference methods explicitly infer a posterior distribution over the unknown task, typically using distinct objectives and sequence models designed to enable task inference. Recent work has shown that task inference methods are not necessary for strong performance. However, it remains unclear whether task inference sequence models are beneficial even when task inference objectives are not. In this paper, we present evidence that task inference sequence models are indeed still beneficial. In particular, we investigate sequence models with permutation invariant aggregation, which exploit the fact that, due to the Markov property, the task posterior does not depend on the order of data. We empirically confirm the advantage of permutation invariant sequence models without the use of task inference objectives. However, we also find, surprisingly, that there are multiple conditions under which permutation variance remains useful. Therefore, we propose SplAgger, which uses both permutation variant and invariant components to achieve the best of both worlds, outperforming all baselines evaluated on continuous control and memory environments. Code is provided at https://github.com/jacooba/hyper.

aggregation, sequence model, splagger, (15 more...)

arXiv.org Artificial Intelligence

2403.0302

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

A CMDP-within-online framework for Meta-Safe Reinforcement Learning

Khattar, Vanshaj, Ding, Yuhao, Sel, Bilgehan, Lavaei, Javad, Jin, Ming

arXiv.org Artificial IntelligenceMay-26-2024

Meta-reinforcement learning has widely been used as a learning-to-learn framework to solve unseen tasks with limited experience. However, the aspect of constraint violations has not been adequately addressed in the existing works, making their application restricted in real-world settings. In this paper, we study the problem of meta-safe reinforcement learning (Meta-SRL) through the CMDP-within-online framework to establish the first provable guarantees in this important setting. We obtain task-averaged regret bounds for the reward maximization (optimality gap) and constraint violations using gradient-based meta-learning and show that the task-averaged optimality gap and constraint satisfaction improve with task-similarity in a static environment or task-relatedness in a dynamic environment. Several technical challenges arise when making this framework practical. To this end, we propose a meta-algorithm that performs inexact online learning on the upper bounds of within-task optimality gap and constraint violations estimated by off-policy stationary distribution corrections. Furthermore, we enable the learning rates to be adapted for every task and extend our approach to settings with a competing dynamically changing oracle. Finally, experiments are conducted to demonstrate the effectiveness of our approach.

algorithm, conference paper, constraint violation, (13 more...)

arXiv.org Artificial Intelligence

2405.16601

Country:

North America > United States > Virginia > Montgomery County > Blacksburg (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Aruba (0.04)

Genre: Research Report (0.81)

Industry: Education > Educational Setting (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Emergence of In-Context Reinforcement Learning from Noise Distillation

Zisman, Ilya, Kurenkov, Vladislav, Nikulin, Alexander, Sinii, Viacheslav, Kolesnikov, Sergey

arXiv.org Artificial IntelligenceFeb-7-2024

Recently, extensive studies in Reinforcement Learning have been carried out on the ability of transformers to adapt in-context to various environments and tasks. Current in-context RL methods are limited by their strict requirements for data, which needs to be generated by RL agents or labeled with actions from an optimal policy. In order to address this prevalent problem, we propose AD$^\varepsilon$, a new data acquisition approach that enables in-context Reinforcement Learning from noise-induced curriculum. We show that it is viable to construct a synthetic noise injection curriculum which helps to obtain learning histories. Moreover, we experimentally demonstrate that it is possible to alleviate the need for generation using optimal policies, with in-context RL still able to outperform the best suboptimal policy in a learning dataset by a 2x margin.

agent, in-context reinforcement learning, learning, (12 more...)

arXiv.org Artificial Intelligence

2312.12275

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Europe > Austria > Vienna (0.04)
Asia > Russia (0.04)

Genre:

Instructional Material > Course Syllabus & Notes (0.46)
Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Beyond variance reduction: Understanding the true impact of baselines on policy optimization

Chung, Wesley, Thomas, Valentin, Machado, Marlos C., Roux, Nicolas Le

arXiv.org Machine LearningAug-31-2020

Policy gradients methods are a popular and effective choice to train reinforcement learning agents in complex environments. The variance of the stochastic policy gradient is often seen as a key quantity to determine the effectiveness of the algorithm. Baselines are a common addition to reduce the variance of the gradient, but previous works have hardly ever considered other effects baselines may have on the optimization process. Using simple examples, we find that baselines modify the optimization dynamics even when the variance is the same. In certain cases, a baseline with lower variance may even be worse than another with higher variance. Furthermore, we find that the choice of baseline can affect the convergence of natural policy gradient, where certain baselines may lead to convergence to a suboptimal policy for any stepsize. Such behaviour emerges when sampling is constrained to be done using the current policy and we show how decoupling the sampling policy from the current policy guarantees convergence for a much wider range of baselines. More broadly, this work suggests that a more careful treatment of stochasticity in the updates---beyond the immediate variance---is necessary to understand the optimization process of policy gradient algorithms.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2008.13773

Country: North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Bootstrapping Simulation-Based Algorithms with a Suboptimal Policy

Nguyen, Truong-Huy Dinh (Northeastern University) | Silander, Tomi (Xerox Research Centre Europe) | Lee, Wee-Sun (National University of Singapore) | Leong, Tze-Yun (National University of Singapore)

AAAI ConferencesJun-9-2014

Finding optimal policies for Markov Decision Processes with large state spaces is in general intractable. Nonetheless, simulation-based algorithms inspired by Sparse Sampling (SS) such as Upper Confidence Bound applied in Trees (UCT) and Forward Search Sparse Sampling (FSSS) have been shown to perform reasonably well in both theory and practice, despite the high computational demand. To improve the efficiency of these algorithms, we adopt a simple enhancement technique with a heuristic policy to speed up the selection of optimal actions. The general method, called Aux, augments the look-ahead tree with auxiliary arms that are evaluated by the heuristic policy. In this paper, we provide theoretical justification for the method and demonstrate its effectiveness in two experimental benchmarks that showcase the faster convergence to a near optimal policy for both SS and FSSS. Moreover, to further speed up the convergence of these algorithms at the early stage, we present a novel mechanism to combine them with UCT so that the resulting hybrid algorithm is superior to both of its components.

bootstrapping simulation-based algorithm, suboptimal policy

AAAI Conferences

Twenty-Fourth International Conference on Automated Planning and Scheduling

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.40)

Add feedback