Evaluating language models as risk scores André F. Cruz Max Planck Institute for Intelligent Systems, Tübingen
–Neural Information Processing Systems
Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty. In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products.
Neural Information Processing Systems
Mar-26-2025, 23:47:44 GMT
- Country:
- Europe > Germany
- Baden-Württemberg > Tübingen Region > Tübingen (0.40)
- North America > United States (1.00)
- Europe > Germany
- Genre:
- Questionnaire & Opinion Survey (0.68)
- Research Report > New Finding (0.92)
- Industry:
- Education (0.70)
- Government (0.92)
- Technology: