AITopics | benchmark suite

Collaborating Authors

benchmark suite

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Realistic Earth Observation Constellation Scheduling Benchmark and Methodology

Neural Information Processing SystemsJun-18-2026, 20:54:08 GMT

Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth's surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their realworld performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains 3,907 finely tuned satellite assets and 16,410 scenarios.

artificial intelligence, guideline, justification, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.94)

Add feedback

Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology

Neural Information Processing SystemsJun-12-2026, 22:47:08 GMT

Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth's surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their real-world performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains $3,907$ finely tuned satellite assets and $16,410$ scenarios.

artificial intelligence, proceedings, scenario, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.39)

Add feedback

An Interpretable and Scalable Framework for Evaluating Large Language Models

Qu, Xinhao, Heng, Qiang, Zeng, Hao, Liu, Xiaoqian

arXiv.org Machine LearningMay-11-2026

Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response Theory (IRT) offers a principled framework for modeling latent model abilities and item characteristics, but conventional methods are computationally expensive and numerically unstable, limiting large-scale implementations. To address these challenges, we propose an interpretable and scalable framework for LLM evaluation based on the majorization-minimization principle. Our approach reformulates the problem as a sequence of constrained matrix factorization subproblems, enabling stable and efficient parameter estimation with theoretical guarantees for identifiability and convergence. Experiments on synthetic and real-world datasets, including MATH-500 and six Open LLM Leaderboard benchmarks, demonstrate that our method achieves superior scalability and interpretability. It delivers orders-of-magnitude speedups over competing methods while maintaining comparable or even higher estimation accuracy. Our results align with established scaling laws and offer insights into item difficulty and discrimination, informing more principled benchmark design.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Machine Learning

2605.07046

Country:

North America > Mexico (0.28)
Europe > Austria (0.28)

Genre:

Overview (0.92)
Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-World Document Analysis

Neural Information Processing SystemsMar-21-2026, 06:15:41 GMT

The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated Q&A pairs. We revisit popular LLMand RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications.

large language model, machine learning, natural language, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Statistical Multicriteria Benchmarking via the GSD-Front

Neural Information Processing SystemsFeb-17-2026, 13:15:15 GMT

For (3), we relax our proposed test using techniques from robust statistics and imprecise probabilities.

artificial intelligence, classifier, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Saxony > Leipzig (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Mateo County > Menlo Park (0.04)
(4 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Government (0.45)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

UDA

Neural Information Processing SystemsFeb-16-2026, 01:34:04 GMT

Cleaning missing values: The human-generated questions may be unanswerable. Thus, we remove the Q&A items that lack available answers. Additionally, documents lacking any valid Q&A pairs are also removed.

artificial intelligence, dataset, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.30)
Asia > China > Shanghai > Shanghai (0.06)
Asia > Singapore (0.05)

Industry: Law (0.96)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Compiler Auto-Vectorization with Imitation Learning

Charith Mendis, Cambridge Yang, Yewen Pu, Dr.Saman Amarasinghe, Michael Carbin

Neural Information Processing SystemsFeb-14-2026, 07:30:56 GMT

Therefore, many existing SLP vectorization schemes are driven by hand-crafted heuristics (Liu et al., 2012;

artificial intelligence, machine learning, urlhttp, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.06)
North America > United States > District of Columbia > Washington (0.05)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

525bd8aedafa375564f73bacdef411e5-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-13-2026, 07:23:28 GMT

Mutual Information (MI) is a fundamental metric for quantifying dependency between two random variables.

artificial intelligence, dataset, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

0668e20b3c9e9185b04b3d2a9dc8fa2d-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-11-2026, 09:07:37 GMT

benchmark, optimization, reviewer, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.36)

Add feedback

Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

Gordienko, Polina, Jansen, Christoph, Rodemann, Julian, Schollmeyer, Georg

arXiv.org Machine LearningFeb-10-2026

Modern benchmarks such as HELM MMLU account for multiple metrics like accuracy, robustness and efficiency. When trying to turn these metrics into a single ranking, natural aggregation procedures can become incoherent or unstable to changes in the model set. We formalize this aggregation as a social choice problem where each metric induces a preference ranking over models on each dataset, and a benchmark operator aggregates these votes across metrics. While prior work has focused on Arrow's impossibility result, we argue that the impossibility often originates from pathological examples and identify sufficient conditions under which these disappear, and meaningful multi-criteria benchmarking becomes possible. In particular, we deal with three restrictions on the combinations of rankings and prove that on single-peaked, group-separable and distance-restricted preferences, the benchmark operator allows for the construction of well-behaved rankings of the involved models. Empirically, we investigate several modern benchmark suites like HELM MMLU and verify which structural conditions are fulfilled on which benchmark problems.

large language model, machine learning, ranking, (19 more...)

arXiv.org Machine Learning

2602.07593

Country:

Europe > Germany > Saxony > Leipzig (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback