Intra-request branch orchestration for efficient LLM reasoning

Jiang, Weifan, Shahout, Rana, Du, Yilun, Mitzenmacher, Michael, Yu, Minlan

arXiv.org Artificial Intelligence 

LLMs increasingly rely on inference-time reasoning algorithms such as chain-of-thought and multi-branch reasoning to improve accuracy on complex tasks. Prior work has primarily focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors. We present DUCHESS, an LLM serving system that reduces computational cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions. Within each request, DUCHESS predicts branch correctness with a lightweight linear probing model over LLM layer activations. The orchestration policy uses these predictions to decide whether to terminate a branch early, duplicate an existing branch, or continue exploring a branch. When handling multiple requests, DUCHESS can further reduce latency by prioritizing easier reasoning tasks, when request complexity can be estimated from the prompt. Experiments on three reasoning benchmarks show that DUCHESS consistently improves the token-accuracy Pareto frontier, reducing token usage by 42-63% at matched accuracy compared to self-consistency. For request serving with vLLM, DUCHESS reduces mean, median, and tail latencies by 57-81%, 58-85%, and 52-84% with First-Come-First-Served (FCFS) scheduling across three datasets, compared to self-consistency. At higher request rates, scheduling jobs by increasing predicted difficulty reduces latency further over FCFS. Large Language Models (LLMs) are widely applied across domains, including math and science problem-solving (Lewkowycz et al., 2022), coding and program analysis (Jiang et al., 2025), logical deduction (Creswell et al., 2023), and decision-making (Y ao et al., 2023).