Estimating the Self-Consistency of LLMs

Sep-25-2025–arXiv.org Artificial Intelligence

Systems often repeat the same prompt to large language models (LLMs) and aggregate responses to improve reliability. Common approaches include self-consistency or simple majority voting (sample multiple outputs and choose the mode), prompt ensembling (rephrasing prompts to reduce wording sensitivity), and multi-agent debate (running multiple instances and aggregating their conclusions). Such consensus methods can stabilize outputs and improve accuracy, especially on multi-step reasoning tasks [1]. This short note analyzes an estimator of the self-consistency of LLMs and the tradeoffs it induces under a fixed compute budget B = mn, where m is the number of prompts sampled from the task distribution and n is the number of repeated LLM calls per prompt; the resulting analysis favors a rough split m,n B. It complements recent work on self-consistency prompting that aggregates multiple sampled reasoning paths to stabilize predictions [2, 3]. Consider a prompt x that requires a binary response.

large language model, natural language, self-consistency error, (9 more...)

arXiv.org Artificial Intelligence

Sep-25-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Wisconsin (0.15)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)