llm eval
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
D'Souza, Jennifer, Giglou, Hamed Babaei, Münch, Quentin
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry.
- Europe (1.00)
- Asia (0.67)
- North America > United States > Minnesota (0.27)
- Leisure & Entertainment > Sports (0.93)
- Health & Medicine > Therapeutic Area (0.67)
- Education > Curriculum > Subject-Specific Education (0.45)
Position: Don't use the CLT in LLM evals with fewer than a few hundred datapoints
Bowyer, Sam, Aitchison, Laurence, Ivanova, Desi R.
Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > Greenland (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- Europe > United Kingdom > England > Bristol (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)