CTTS: Collective Test-Time Scaling

Song, Zhende, Tang, Shengji, Ye, Peng, Fan, Jiayuan, Bai, Lei, Chen, Tao, Ouyang, Wanli

arXiv.org Artificial Intelligence 

Test-time scaling (TTS) has emerged as a promising, training-free approach for enhancing large language model (LLM) performance. However, the efficacy of existing methods, such as Best-of-N and Self-Consistency, is fundamentally constrained by the dominant single test-time scaling (STTS) paradigm, which relies on a single LLM agent interacting with a single reward model (SA-SR). Inspired by recent work showing that collective methods can surpass the performance ceiling of individual models, we introduce Collective T est-Time Scaling (CTTS). First, we systematically investigate three primary interaction paradigms of existing multiple models: single-agent-multi-reward (SA-MR), multi-agent-single-reward (MA-SR), and multi-agent-multi-reward (MA-MR). Extensive experiments reveal that the MA-MR paradigm is consistently superior. Based on this finding, we further propose CTTS-MM, a novel framework that operationalizes multi-agent and multi-reward collaboration. CTTS-MM integrates two key technical contributions: (1) for agent collaboration, an Agent Collaboration Search (ACS) that identifies the most effective combination of LLMs from a candidate pool; and (2) for reward model collaboration, a Mixture of Reward Models (MoR) strategy that leverages a Prior Reward model Ensemble Selection (PRES) algorithm to select the optimal ensemble. Evaluations across seven mainstream benchmarks demonstrate that CTTS-MM significantly outperforms leading STTS methods (+4.82% over Best-of-N) and surpasses even flagship proprietary LLMs (+7.06% over GPT - 4.1) and open-source LLMs. These results highlight the substantial potential of collective scaling to push the frontier of LLM inference. Recent advancements in large language models (LLMs) OpenAI (2025); Y ang et al. (2024b); Brown et al. (2020); DeepSeek-AI & et al. (2025); Touvron et al. (2023) have marked a significant milestone in natural language understanding and generation. LLMs are typically optimized through training-time scaling, where huge amounts of data and parameters are applied, facing growing limitations due to their resource-intensive nature and the endless hunger for human data. To avoid introducing an extra expensive training process, test-time scaling (TTS) has emerged as an orthogonal direction for fully stimulating the ability of pre-trained LLMs during inference.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found