Goto

Collaborating Authors

 rema


ReMA: Learning to Meta-think for LLMs with Multi-agent Reinforcement Learning

Neural Information Processing Systems

Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking--enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit metathinking behaviors, encouraging LLMs to think about thinking.


ReMA: Learning to Meta-Think for LLMs with Multi-agent Reinforcement Learning

Neural Information Processing Systems

Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking--enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking.



Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation

arXiv.org Artificial Intelligence

Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks. Techniques such as chain-of-thought prompting (Wei et al., 2022; Kojima et al., 2022) and structured methods like Tree-of-Thoughts and Graph-of-Thoughts (Y ao et al., 2023; Besta et al., 2024) expand the space for deliberation. More recently, multi-agent frameworks enable LLMs with specialized roles to collaborate via planning, delegation, and debate, echoing human team dynamics (Li et al., 2023; Wu et al., 2024a; Chen et al., 2023; Du et al., 2023; Y uan & Xie). To support multi-agent and multi-turn reinforcement learning, multi-turn Group Relative Preference Optimization (GRPO) (Wan et al., 2025; Shi et al., 2025; Wei et al., 2025) and its variants (Guo et al., 2025b; Zhang et al., 2025c; Ning et al., 2025; Xue et al., 2025) compute advantages and importance ratios at the turn level, enabling finer-grained optimization and more precise credit assignment. Building on this foundation, ReMA (Wan et al., 2025) introduces a multi-agent LLM reasoning framework with two specialized roles: a meta-thinking agent, which decomposes tasks, sets intermediate goals, and adapts based on feedback, and a reasoning agent, which performs step-by-step 1 The agents alternate sequentially, but since only a final outcome reward is available, ReMA computes a group advantage following GRPO (Shao et al., 2024) and uniformly assigns this trajectory-level signal to every turn in the rollout.



PREMA: Principled Tensor Data Recovery from Multiple Aggregated Views

arXiv.org Machine Learning

Multidimensional data have become ubiquitous and are frequently involved in situations where the information is aggregated over multiple data atoms. The aggregation can be over time or other features, such as geographical location or group affiliation. We often have access to multiple aggregated views of the same data, each aggregated in one or more dimensions, especially when data are collected or measured by different agencies. However, data mining and machine learning models require detailed data for personalized analysis and prediction. Thus, data disaggregation algorithms are becoming increasingly important in various domains. The goal of this paper is to reconstruct finer-scale data from multiple coarse views, aggregated over different (subsets of) dimensions. The proposed method, called PREMA, leverages low-rank tensor factorization tools to provide recovery guarantees under certain conditions. PREMA is flexible in the sense that it can perform disaggregation on data that have missing entries, i.e., partially observed. The proposed method considers challenging scenarios: i) the available views of the data are aggregated in two dimensions, i.e., double aggregation, and ii) the aggregation patterns are unknown. Experiments on real data from different domains, i.e., sales data from retail companies, crime counts, and weather observations, are presented to showcase the effectiveness of PREMA.