Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation

Xiong, Siheng, Payani, Ali, Fekri, Faramarz

arXiv.org Artificial Intelligence 

Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. OpenAI's o1 series (OpenAI, 2024) introduce inference-time scaling by increasing the length of the Chain-of-Thought (CoT) (Wei et al., 2022) reasoning process. Despite their empirical success, RL approaches that generate the entire reasoning chain in a single forward pass face notable limitations, including CoT derailment, where the reasoning trajectory drifts off course due to accumulated errors, and the inherent challenges of long-horizon RL with sparse outcome rewards. This sequential scaling strategy, i.e., simply extending the CoT length, can therefore be insufficient (Y ang et al., 2025). To improve planning quality, we introduce Multi-Path Plan Aggregation (MPP A). For each planning step, the model generates multiple alternative plans and aggregates them into an improved plan before proceeding to the subsequent execution steps. Beyond enhancing planning, we identify a fundamental challenge in credit assignment for long-horizon policy learning (Kaelbling et al., 1996). Existing RL fine-tuning frameworks struggle to provide effective process-level supervision (Guo et al., 2025). First, evaluating the correctness of intermediate steps is inherently difficult. Automated annotation using LLM judges (Gu et al., 2024) often yield unreliable or noisy signals Second, introducing a separate process reward model (PRM) adds complexity. We then define the process preference between two candidate continuations at the same step by comparing their incremental log-weights. We repurpose Twisted Sequential Monte Carlo (TSMC) to provide process-level preferences for online Step-DPO training. Results show that our approach consistently outperforms both distillation-based long-CoT methods and RL methods that rely solely on outcome rewards. The Chain-of-Thought trajectories can be lengthy and the positions of the first error vary considerably, making outcome-based RL fine-tuning inefficient. Training long trajectories with outcome rewards is highly inefficient.