AITopics | mmlu-pro

Collaborating Authors

mmlu-pro

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Neural Information Processing SystemsMar-22-2026, 01:04:20 GMT

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates part of the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16\% to 33\% compared to MMLU, but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5\% in MMLU to just 2\% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is more discriminative benchmark to better track progress in the field.

artificial intelligence, mmlu-pro, natural language, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.61)

Add feedback

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Neural Information Processing SystemsFeb-17-2026, 09:48:02 GMT

However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > France (0.04)
Europe > Spain > Galicia > Madrid (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TeamMedAgents: Enhancing Medical Decision-Making of LLMs Through Structured Teamwork

Mishra, Pranav Pushkar, Arvan, Mohammad, Zalake, Mohan

arXiv.org Artificial IntelligenceDec-4-2025

Building upon Salas et al.'s "Big Five" teamwork model, we operationalize five core components as independently configurable mechanisms: shared mental models, team leadership, team orientation, trust networks, and mutual monitoring. Our architecture dynamically recruits 2-4 specialist agents and employs structured four-phase deliberation with adaptive component selection. Evaluation across eight medical benchmarks encompassing 11,545 questions demonstrates TeamMedAgents achieves 77.63% overall accuracy (text-based: 81.30%, vision-language: 66.60%). Systematic ablation studies comparing three single-agent baselines (Zero-Shot, Few-Shot, CoT) against individual teamwork components reveal task-specific optimization patterns: shared mental models excel on knowledge tasks, trust mechanisms improve differential diagnosis, while comprehensive integration degrades performance. Adaptive component selection yields 2-10 percentage point improvements over strongest baselines, with 96.2% agent convergence validating structured coordination effectiveness. TeamMedAgents establishes principled methodology for translating human teamwork theory into multi-agent systems, demonstrating that evidence-based collaboration patterns enhance AI performance in safety-critical domains through modular component design and selective activation strategies.

artificial intelligence, arxiv preprint arxiv, natural language, (15 more...)

arXiv.org Artificial Intelligence

2508.08115

Country: North America > United States > Illinois (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine > Diagnostic Medicine (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space

Wang, Ante, Ma, Weizhi, Liu, Yang

arXiv.org Artificial IntelligenceNov-19-2025

Knowing the reliability of a model's response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.

confidence score, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.14275

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Improving Metacognition and Uncertainty Communication in Language Models

Steyvers, Mark, Belem, Catarina, Smyth, Padhraic

arXiv.org Artificial IntelligenceOct-23-2025

Large language models (LLMs) are increasingly used in decision-making contexts, but when they present answers without signaling low confidence, users may unknowingly act on erroneous outputs. Prior work shows that LLMs maintain internal uncertainty signals, yet their expressed confidence is often miscalibrated and poorly discriminates between correct and incorrect answers. We investigate whether supervised fine-tuning can improve models' ability to communicate uncertainty and whether such improvements generalize across tasks and domains. We fine-tune LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia, and evaluate two metacognitive tasks: (1) single-question confidence estimation, where the model assigns a numeric certainty to its answer, and (2) pairwise confidence comparison, where the model selects which of two answers it is more likely to answer correctly. We assess generalization to unseen domains, including medical and legal reasoning. Results show that fine-tuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains. However, gains are task-specific: training on single-question calibration does not transfer to pairwise comparison, and vice versa. Multitask fine-tuning yields broader gains, lowering calibration error and strengthening discrimination in out-of-domain evaluations. This suggests that uncertainty communication in LLMs is trainable but requires multitask training to generalize effectively.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.05126

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

Lim, Henry, Lim, Kwan Hui

arXiv.org Artificial IntelligenceOct-21-2025

Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.17388

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

How Reliable is Language Model Micro-Benchmarking?

Yauney, Gregory, Warraich, Shahzaib Saqib, Swayamdipta, Swabha

arXiv.org Artificial IntelligenceOct-13-2025

Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.

benchmark, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.0873

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 13:03:24 GMT

arxiv preprint arxiv, benchmark, total weight contribution, (10 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > France (0.04)
Europe > Spain > Galicia > Madrid (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A-VERT: Agnostic Verification with Embedding Ranking Targets

Aguirre, Nicolás, Caso, Ramiro, Colmeiro, Ramiro Rodríguez, Santelli, Mauro, Calderón, Joaquín Toranzo

arXiv.org Artificial IntelligenceOct-3-2025

The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.01469

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.66)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Rethinking Reward Models for Multi-Domain Test-Time Scaling

Lee, Dong Bok, Lee, Seanie, Park, Sangwoo, Kang, Minki, Baek, Jinheon, Kim, Dongki, Wagner, Dominik, Jin, Jiongdao, Lee, Heejun, Bocklet, Tobias, Wang, Jinyu, Fu, Jingjing, Hwang, Sung Ju, Bian, Jiang, Song, Lei

arXiv.org Artificial IntelligenceOct-3-2025

The reliability of large language models (LLMs) during test-time scaling is often assessed with \emph{external verifiers} or \emph{reward models} that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM (\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not competitive, and (iii) overall, \GenORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at \href{https://github.com/db-Lee/Multi-RM}{\underline{\small\texttt{https://github.com/db-Lee/Multi-RM}}} to facilitate future research in multi-domain settings.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.00492

Country:

Asia (0.93)
Europe > Austria (0.28)
North America > Mexico (0.28)

Genre: Research Report (0.81)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback