AITopics

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.50)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Russia (0.04)
(3 more...)

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Neural Information Processing SystemsFeb-7-2026, 17:16:11 GMT

1dc3a89d0d440ba31729b0ba74b93a33-Supplemental.pdf

ampo, model adaptation, variant, (14 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsFeb-7-2026, 17:16:04 GMT

Model-based Policy Optimization with Unsupervised Model Adaptation

In recent years, model-free reinforcement learning (MFRL) has achieved tremendous success on a wide range of simulated domains, e .

artificial intelligence, machine learning, reinforcement learning, (11 more...)

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.93)

arXiv.org Artificial IntelligenceOct-10-2025

More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Yuan, Xiaoyang, Ding, Yujuan, Bin, Yi, Shao, Wenqi, Cai, Jinyu, Song, Jingkuan, Yang, Yang, Shen, Heng Tao

Reinforcement Learning with V erifiable Rewards (RL VR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO. Recent advances in long chain-of-thought (LongCoT) Jaech et al. (2024); Guo et al. (2025); Team et al. (2025) have endowed Large Language Models (LLMs) with remarkable complex reasoning capabilities. A key driver behind this progress is the Reinforcement Learning with V erifiable Rewards (RL VR) paradigm.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

2510.02227

Country: Asia > China (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Neural Information Processing SystemsOct-8-2025, 19:14:37 GMT

61a9278dfef5f871b5e472389f8d6fa1-Paper-Conference.pdf

ampo, machine learning, reinforcement learning, (18 more...)

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > France (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
(3 more...)

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Neural Information Processing SystemsOct-2-2025, 09:11:26 GMT

Appendix for Model based Policy Optimization with Unsupervised Model Adaptation A Omitted Proofs

Besides Wasserstein distance, we can use other distribution divergence metrics to align the features. MMD is another instance of IPM when the witness function class is the unit ball in a reproducing kernel Hilbert space (RKHS). The results on three environments are shown in Figure 5. We show the one-step model losses during the experiments in the other four environments in Figure D.5. We find that the conclusion in Section 5.2 still holds in these four environments.

ampo, artificial intelligence, machine learning, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceMay-23-2025

Adaptive Thinking via Mode Policy Optimization for Social Language Agents

Wang, Minzheng, Li, Yongbin, Wang, Haobo, Zhang, Xinghua, Xu, Nan, Wu, Bingli, Huang, Fei, Yu, Haiyang, Mao, Wenji

Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current studies. Existing methods either lack this kind of reasoning capability or enforce Long Chain-of-Thought reasoning uniformly across all scenarios, resulting in excessive token usage and inflexible social simulation. To address this, we propose an $\textbf{A}$daptive $\textbf{M}$ode $\textbf{L}$earning ($\textbf{AML}$) framework in this paper, aiming to improve the adaptive thinking ability of language agents in dynamic social interactions. To this end, we first identify hierarchical thinking modes ranging from intuitive response to deep deliberation based on the cognitive control theory. We then develop the $\textbf{A}$daptive $\textbf{M}$ode $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{AMPO}$) algorithm to optimize the context-aware mode switching and reasoning. Our framework advances existing research in three key aspects: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing. Extensive experiments on social intelligence benchmarks verify that AML achieves 15.6% higher task performance than GPT-4o. Notably, our AMPO outperforms GRPO by 7.0% with 32.8% shorter reasoning chains, demonstrating the advantage of adaptive thinking mode selection and optimization mechanism in AMPO over GRPO's fixed-depth solution.

large language model, machine learning, natural language, (20 more...)

2505.02156

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

arXiv.org Artificial IntelligenceOct-11-2024

AMPO: Automatic Multi-Branched Prompt Optimization

Yang, Sheng, Wu, Yurong, Gao, Yan, Zhou, Zineng, Zhu, Bin Benjamin, Sun, Xiaodi, Lou, Jian-Guang, Ding, Zhiming, Hu, Anbang, Fang, Yuan, Li, Yunsong, Chen, Junyan, Yang, Linjun

Prompt engineering is very important to enhance the performance of large language models (LLMs). When dealing with complex issues, prompt engineers tend to distill multiple patterns from examples and inject relevant solutions to optimize the prompts, achieving satisfying results. However, existing automatic prompt optimization techniques are only limited to producing single flow instructions, struggling with handling diverse patterns. In this paper, we present AMPO, an automatic prompt optimization method that can iteratively develop a multi-branched prompt using failure cases as feedback. Our goal is to explore a novel way of structuring prompts with multi-branches to better handle multiple patterns in complex tasks, for which we introduce three modules: Pattern Recognition, Branch Adjustment, and Branch Pruning. In experiments across five tasks, AMPO consistently achieves the best results. Additionally, our approach demonstrates significant optimization efficiency due to our adoption of a minimal search strategy.

large language model, machine learning, natural language, (16 more...)

2410.08696

Country:

Europe > Middle East > Malta > Port Region > Southern Harbour District > Floriana (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Alfano, Carlo, Towers, Sebastian, Sapora, Silvia, Lu, Chris, Rebeschini, Patrick

Meta-learning the mirror map in policy mirror descent

arXiv.org Artificial IntelligenceFeb-7-2024

Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD's full potential is limited, with the majority of research focusing on a particular mirror map -- namely, the negative entropy -- which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD's efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. By applying a meta-learning approach, we identify more efficient mirror maps that enhance performance, both on average and in terms of best performance achieved along the training trajectory. We analyze the characteristics of these learned mirror maps and reveal shared traits among certain settings. Our results suggest that mirror maps have the potential to be adaptable across various environments, raising questions about how to best match a mirror map to an environment's structure and characteristics.

algorithm, ampo, mirror map, (13 more...)