AITopics | mapo

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Neural Information Processing SystemsMar-17-2026, 02:07:23 GMT

We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WikiTableQuestions benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WikiSQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at https://goo.gl/TXBp4e

artificial intelligence, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.63)
Information Technology > Artificial Intelligence > Machine Learning (0.63)

Add feedback

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V. Le, Ni Lao

Neural Information Processing SystemsFeb-15-2026, 06:42:36 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country:

Asia > Thailand (0.05)
North America > Canada > Quebec > Montreal (0.04)
Europe > Finland (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language (0.47)

Add feedback

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Neural Information Processing SystemsNov-20-2025, 23:13:51 GMT

We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WikiTableQuestions benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WikiSQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at https://goo.gl/TXBp4e

artificial intelligence, memory augmented policy optimization, natural language, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V. Le, Ni Lao

Neural Information Processing SystemsNov-20-2025, 21:14:06 GMT

MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards.

mapo, memory buffer, trajectory, (14 more...)

Neural Information Processing Systems

Country:

Asia > Thailand (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Hungary (0.04)
(5 more...)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

MAPO: Mixed Advantage Policy Optimization

Huang, Wenke, Zhang, Quan, Fang, Yiyang, Liang, Jian, Rong, Xuankun, Yao, Huanjin, Wan, Guancheng, Liang, Ke, He, Wenwen, Li, Mingjun, Rutkowski, Leszek, Ye, Mang, Du, Bo, Tao, Dacheng

arXiv.org Artificial IntelligenceSep-26-2025

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2509.18849

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Introducing MAPO: Momentum-Aided Gradient Descent Prompt Optimization

Cui, Anthony, Nandyalam, Pranav, Cheung, Ethan, Zhu, Kevin

arXiv.org Artificial IntelligenceNov-1-2024

Momentum-Aided Prompt Optimization (MAPO) enhances the efficiency and efficacy of prompt optimization for Large Language Models (LLMs). Building on ProTeGi, MAPO uses positive natural language "gradients" and a momentum-based extension to refine prompts effectively. By tracking gradient history, MAPO avoids local minima and oscillations. It also utilizes beam search and an Upper Confidence Bound (UCB) algorithm for balanced candidate expansion and selection. Benchmark testing shows that MAPO achieves faster convergence time with fewer API calls and higher F1 scores than ProTeGi, proving it as a robust and scalable solution for automated prompt engineering in LLMs.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.19499

Genre: Research Report (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.43)

Add feedback

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Neural Information Processing SystemsOct-8-2024, 20:08:53 GMT

We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards.

memory augmented policy optimization, program synthesis and semantic parsing, trajectory, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.78)

Add feedback

MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Chen, Yuyan, Wen, Zhihao, Fan, Ge, Chen, Zhengyu, Wu, Wei, Liu, Dayiheng, Li, Zhixu, Liu, Bang, Xiao, Yanghua

arXiv.org Artificial IntelligenceJul-4-2024

Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.

dataset, llama, llm, (15 more...)

arXiv.org Artificial Intelligence

2407.04118

Country:

North America > United States (0.04)
Asia > Taiwan (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(4 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.67)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Liang, Chen, Norouzi, Mohammad, Berant, Jonathan, Le, Quoc V., Lao, Ni

Neural Information Processing SystemsFeb-14-2020, 20:57:12 GMT

We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards.

memory augmented policy optimization, program synthesis and semantic parsing, trajectory, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.79)
Information Technology > Artificial Intelligence > Machine Learning (0.72)

Add feedback

Learning to Generalize from Sparse and Underspecified Rewards

Agarwal, Rishabh, Liang, Chen, Schuurmans, Dale, Norouzi, Mohammad

arXiv.org Machine LearningFeb-19-2019

We consider the problem of learning from sparse and underspecified rewards, where an agent receives a complex input, such as a natural language instruction, and needs to generate a complex response, such as an action sequence, while only receiving binary success-failure feedback. Such success-failure rewards are often underspecified: they do not distinguish between purposeful and accidental success. Generalization from underspecified rewards hinges on discounting spurious trajectories that attain accidental success, while learning from sparse feedback requires effective exploration. We address exploration by using a mode covering direction of KL divergence to collect a diverse set of successful trajectories, followed by a mode seeking KL divergence to train a robust policy. We propose Meta Reward Learning (MeRL) to construct an auxiliary reward function that provides more refined feedback for learning. The parameters of the auxiliary reward function are optimized with respect to the validation performance of a trained policy. The MeRL approach outperforms our alternative reward learning technique based on Bayesian Optimization, and achieves the state-of-the-art on weakly-supervised semantic parsing. It improves previous work by 1.2% and 2.4% on WikiTableQuestions and WikiSQL datasets respectively.

learning, reward function, trajectory, (16 more...)

arXiv.org Machine Learning

1902.07198

Country: