Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning
Wang, Haonan, Du, Chao, Kawaguchi, Kenji, Pang, Tianyu
–arXiv.org Artificial Intelligence
Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a "majority" over complete solutions is ill-defined. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs. Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling. As evidenced by OpenAI's o1 (OpenAI, 2024), DeepSeek-R1 (Guo et al., 2025), etc., models generate extended "think" segments that reflect intermediate hypotheses, derivations, and self-corrections prior to emitting the final answer (Chen et al., 2025b; Y ang et al., 2025c). Such sequential test-time scaling has established a new paradigm: increasing the inference-time computation (e.g., longer reasoning traces) often leads to improved accuracy and problem-solving capability. Y et simply lengthening the chain has diminishing returns and can even hurt, e.g., overthinking (Chen et al., 2024; Cuadron et al., 2025), with studies showing that correct answers often appear in shorter traces (Zeng et al., 2025).
arXiv.org Artificial Intelligence
Dec-3-2025
- Country:
- Asia
- Middle East > Jordan (0.04)
- Singapore (0.04)
- Europe > Slovenia
- Drava > Municipality of Benedikt > Benedikt (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.34)
- Technology: