Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems
Agrawal, Aakriti, Aralikatti, Rohith, Satheesh, Anirudh, Chakraborty, Souradip, Bedi, Amrit Singh, Huang, Furong
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.
arXiv.org Artificial Intelligence
Oct-6-2025
- Country:
- North America > United States > Maryland (0.04)
- Genre:
- Research Report (0.82)
- Industry:
- Education (0.47)
- Technology: