ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Lior, Gili, Habba, Eliya, Levy, Shahar, Caciularu, Avi, Stanovsky, Gabriel
–arXiv.org Artificial Intelligence
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.
arXiv.org Artificial Intelligence
Sep-16-2025
- Country:
- North America
- United States (0.46)
- Mexico > Mexico City (0.14)
- North America
- Genre:
- Research Report (0.64)
- Technology: