AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsFeb-9-2026, 19:31:56 GMT

Efficientmulti-promptevaluationofLLMs

large language model, machine learning, natural language, (21 more...)

Country: Europe > Denmark > Capital Region > Copenhagen (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsOct-9-2025, 21:30:57 GMT

Efficient multi-prompt evaluation of LLMs Felipe Maia Polo

Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the

arxiv preprint arxiv, evaluation, prompt template, (14 more...)

Country:

South America > Brazil > Minas Gerais (0.04)
North America > United States > Michigan (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Information Technology (0.67)
Education > Assessment & Standards (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

arXiv.org Artificial IntelligenceSep-29-2025

The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?

Lai, Guannan, Zhou, Da-Wei, Yang, Xin, Ye, Han-Jia

Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones, while maintaining stable performance across all possible class sequences. In real-world settings, the order in which classes arrive is diverse and unpredictable, and model performance can vary substantially across different sequences. Yet mainstream evaluation protocols calculate mean and variance from only a small set of randomly sampled sequences. Our theoretical analysis and empirical results demonstrate that this sampling strategy fails to capture the full performance range, resulting in biased mean estimates and a severe underestimation of the true variance in the performance distribution. We therefore contend that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution. To this end, we introduce the concept of extreme sequences and provide theoretical justification for their crucial role in the reliable evaluation of CIL. Moreover, we observe a consistent positive correlation between inter-task similarity and model performance, a relation that can be leveraged to guide the search for extreme sequences. Building on these insights, we propose EDGE (Extreme case-based Distribution and Generalization Evaluation), an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity, offering a closer approximation of the ground-truth performance distribution. Extensive experiments demonstrate that EDGE effectively captures performance extremes and yields more accurate estimates of distributional boundaries, providing actionable insights for model selection and robustness checking. Our code is available at https://github.com/AIGNLAI/EDGE.

artificial intelligence, machine learning, sequence, (17 more...)

2509.2258

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)

arXiv.org Artificial IntelligenceJul-15-2025

On Evaluating Performance of LLM Inference Serving Systems

Agrawal, Amey, Kedia, Nitin, Agarwal, Anmol, Mohan, Jayashree, Kwatra, Nipun, Kundu, Souvik, Ramjee, Ramachandran, Tumanov, Alexey

The rapid evolution of Large Language Model (LLM) inference systems has yielded significant efficiency improvements. However, our systematic analysis reveals that current evaluation methodologies frequently exhibit fundamental flaws, often manifesting as common evaluation anti-patterns that obscure true performance characteristics and impede scientific progress. Through a comprehensive examination of recent systems, we identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation Setup, and Metric Design. These anti-patterns are uniquely problematic for LLM inference due to its dual-phase nature combining distinct prefill and decode operations, its handling of highly heterogeneous workloads, and its strict temporal requirements for interactive use. We demonstrate how common anti-patterns -- such as inadequate baseline comparisons that conflate engineering effort with algorithmic novelty, workload selections that fail to represent production scenarios, and metric normalizations that hide substantial performance variability like generation stalls-lead to misleading conclusions. To address these challenges, we provide a comprehensive checklist derived from our analysis, establishing a framework for recognizing and avoiding these anti-patterns in favor of robust LLM inference evaluation. To demonstrate the practical application of our framework, we present a case study analyzing speculative decoding, a technique whose bursty, non-uniform token generation is easily misinterpreted when evaluated using approaches characteristic of these anti-patterns. Our work establishes a rigorous foundation for evaluation methodology, enabling meaningful comparisons, ensuring reproducible results, and ultimately accelerating genuine progress in LLM inference systems by moving beyond common anti-patterns to align evaluation with real-world requirements.

large language model, latency, natural language, (16 more...)

2507.09019

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsMay-26-2025, 19:19:16 GMT

Efficient multi-prompt evaluation of LLMs

artificial intelligence, large language model, natural language, (9 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceMar-29-2025

Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion

Nasr-Esfahany, Arash, Alizadeh, Mohammad, Lee, Victor, Alam, Hanna, Coon, Brett W., Culler, David, Dadu, Vidushi, Dixon, Martin, Levy, Henry M., Pandey, Santosh, Ranganathan, Parthasarathy, Yazdanbakhsh, Amir

Cycle-level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures. Unlike existing simulators and learning approaches that emulate each instruction, Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components. It derives these performance distributions using simple analytical models that estimate bounds on performance induced by each microarchitectural component, providing a simple yet rich representation of a program's performance characteristics across a large space of microarchitectural parameters. Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator, with about 2% average Cycles-Per-Instruction (CPI) prediction error across a range of SPEC, open-source, and proprietary benchmarks. This enables rapid design-space exploration and performance sensitivity analyses that are currently infeasible, e.g., in about an hour, we conducted a first-of-its-kind fine-grained performance attribution to different microarchitectural components across a diverse set of programs, requiring nearly 150 million CPI evaluations.

artificial intelligence, machine learning, modeling & simulation, (15 more...)

2503.23076

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Yemen > Amran Governorate > Amran (0.04)

Genre: Research Report (1.00)

Industry: Energy (0.93)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceFeb-27-2025

Distributional Scaling Laws for Emergent Capabilities

Zhao, Rosie, Qin, Tian, Alvarez-Melis, David, Kakade, Sham, Saphra, Naomi

In this paper, we explore the nature of sudden breakthroughs in language model performance at scale, which stands in contrast to smooth improvements governed by scaling laws. While advocates of "emergence" view abrupt performance gains as capabilities unlocking at specific scales, others have suggested that they are produced by thresholding effects and alleviated by continuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes, particularly when performance is bimodally distributed across random seeds. In synthetic length generalization tasks, we show that different random seeds can produce either highly linear or emergent scaling trends. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. Furthermore, we provide a case study of inverse scaling and show that even as the probability of a successful run declines, the average performance of a successful run continues to increase monotonically. We validate our distributional scaling framework on realistic settings by measuring MMLU performance in LLM populations. These insights emphasize the role of random variation in the effect of scale on LLM capabilities.

distributional scaling law, performance distribution, probability, (15 more...)

2502.17356

Country:

Asia > Middle East > Jordan (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.56)

arXiv.org Artificial IntelligenceFeb-13-2025

Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection

Anwar, Abrar, Gupta, Rohan, Merchant, Zain, Ghosh, Sayan, Neiswanger, Willie, Thomason, Jesse

Evaluating learned robot control policies to determine their physical task-level capabilities costs experimenter time and effort. The growing number of policies and tasks exacerbates this issue. It is impractical to test every policy on every task multiple times; each trial requires a manual environment reset, and each task change involves re-arranging objects or even changing robots. Naively selecting a random subset of tasks and policies to evaluate is a high-cost solution with unreliable, incomplete results. In this work, we formulate robot evaluation as an active testing problem. We propose to model the distribution of robot performance across all tasks and policies as we sequentially execute experiments. Tasks often share similarities that can reveal potential relationships in policy behavior, and we show that natural language is a useful prior in modeling these relationships between tasks. We then leverage this formulation to reduce the experimenter effort by using a cost-aware expected information gain heuristic to efficiently select informative trials. Our framework accommodates both continuous and discrete performance outcomes. We conduct experiments on existing evaluation data from real robots and simulations. By prioritizing informative trials, our framework reduces the cost of calculating evaluation metrics for robot policies across many tasks.

artificial intelligence, evaluation, experiment, (16 more...)

2502.09829

Country:

North America > United States > California (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Mirzadeh, Iman, Alizadeh, Keivan, Shahrokhi, Hooman, Tuzel, Oncel, Bengio, Samy, Farajtabar, Mehrdad

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

arXiv.org Artificial IntelligenceOct-7-2024

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

accuracy, gsm-symbolic, reasoning, (16 more...)

2410.05229

Country:

Europe > Austria > Vienna (0.14)
North America > United States > New York (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Setting (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)