ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

Jun-12-2026, 02:45:07 GMT–Neural Information Processing Systems

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to robustly evaluate the reasoning capability of LLMs. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Jun-12-2026, 02:45:07 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)