AITopics | hybrid policy optimization

Collaborating Authors

hybrid policy optimization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Hybrid Policy Optimization from Imperfect Demonstrations

Neural Information Processing SystemsDec-23-2025, 21:59:05 GMT

Exploration is one of the main challenges in Reinforcement Learning (RL), especially in environments with sparse rewards. Learning from Demonstrations (LfD) is a promising approach to solving this problem by leveraging expert demonstrations. However, expert demonstrations of high quality are usually costly or even impossible to collect in real-world applications. In this work, we propose a novel RL algorithm called HYbrid Policy Optimization (HYPO), which uses a small number of imperfect demonstrations to accelerate an agent's online learning process. The key idea is to train an offline guider policy using imitation learning in order to instruct an online agent policy to explore efficiently. Through mutual update of the guider policy and the agent policy, the agent can leverage suboptimal demonstrations for efficient exploration while avoiding the conservative policy caused by imperfect demonstrations. Empirical results show that HYPO significantly outperforms several baselines in various challenging tasks, such as MuJoCo with sparse rewards, Google Research Football, and the AirSim drone simulation.

demonstration, hybrid policy optimization, name change, (8 more...)

Neural Information Processing Systems

Genre: Research Report (0.61)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Add feedback

HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

Deng, Ken, Zhan, Zizheng, Xiang, Wen, Zhu, Wenqiang, Li, Weihao, Xu, Jingxuan, Peng, Tianhao, Lei, Xinping, Wu, Kun, Yao, Yifan, Huang, Haoyang, Tang, Huaixi, Lei, Kepeng, Lai, Zhiyi, Yu, Songwei, Feng, Zongxian, Gao, Zuchen, Xie, Weihao, Zhang, Chenchen, Wu, Yanan, Zhang, Yuanxing, Huang, Lecheng, Zhang, Yuqun, Liu, Jie, Zhang, Zhaoxiang, Zhang, Haotian, Chen, Bin, Liu, Jiaheng

arXiv.org Artificial IntelligenceOct-22-2025

Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.23967

Genre: Research Report (0.54)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Hybrid Policy Optimization from Imperfect Demonstrations

Neural Information Processing SystemsOct-9-2024, 17:43:55 GMT

demonstration, hybrid policy optimization, policy optimization, (6 more...)

Neural Information Processing Systems

Genre: Research Report (0.44)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.64)

Add feedback