Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

Yang, Wei, Pang, Jiacheng, Li, Shixuan, Bogdan, Paul, Tu, Stephen, Thomason, Jesse

Nov-11-2025–arXiv.org Artificial Intelligence

Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision-focused policy gradients with a list-wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. The rise of large language models (LLMs) have enabled a new type of multi-agent system (MAS) (Park et al., 2023; Chen et al., 2023a; Zhu et al., 2025), where multiple model instances collaborate to tackle problems that exceed the capacity of any single model (Zhang et al., 2024a; Qiao et al., 2024; Han et al., 2025). By distributing roles and enabling structured interaction, MASs hold the promise of achieving robustness, creativity, and reliability that emerge from collective intelligence (Cheng et al., 2024; Pezeshkpour et al., 2024). At the heart of any effective collaborative system lies a fundamental cognitive tension. Early work in the psychology of creativity (Runco & Chand, 1995; Brophy, 2001; Zhang et al., 2020) emphasizes that intelligent problem-solving requires a dynamic balance between two seemingly contradictory modes of thought: Divergent Creativity and Convergent Critique. Guilford's theory of divergent and convergent thinking (Guilford, 1967) formalizes this duality: divergence is the generative process of exploring a wide array of alternative hypotheses, while convergence is the evaluative process of comparing, refining, and synthesizing these options.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Nov-11-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Industry:
- Education (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Agents
    - Agent Societies (0.46)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found