AITopics | llm performance

Collaborating Authors

llm performance

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models

Neural Information Processing SystemsFeb-14-2026, 23:12:00 GMT

We contend that utilizing existing relational databases is a promising approach to construct a benchmark that has both merits.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
Europe > Austria > Vienna (0.14)
Europe > Italy (0.04)
(21 more...)

Genre: Research Report (0.87)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Sports > Soccer (1.00)
Transportation > Air (0.94)
Transportation > Infrastructure & Services > Airport (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Efficient Evaluation of LLM Performance with Statistical Guarantees

Wu, Skyler, Nair, Yash, Candès, Emmanuel J.

arXiv.org Machine LearningJan-30-2026

Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized Active Querying (FAQ), which (a) leverages historical information through a Bayesian factor model; (b) adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy; and (c) maintains validity through Proactive Active Inference -- a finite-population extension of active inference (Zrnic & Candès, 2024) that enables direct question selection while preserving coverage. With negligible overhead cost, FAQ delivers up to $5\times$ effective sample size gains over strong baselines on two benchmark suites, across varying historical-data missingness levels: this means that it matches the CI width of uniform sampling while using up to $5\times$ fewer queries. We release our source code and our curated datasets to support reproducible evaluation and future research.

efficient evaluation, large language model, machine learning, (19 more...)

arXiv.org Machine Learning

2601.20251

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries

Neural Information Processing SystemsDec-27-2025, 10:57:29 GMT

Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions.

artificial intelligence, large language model, natural language, (13 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Health Care Technology > Medical Record (0.58)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)

Add feedback

Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

Pyeon, Goun, Heo, Inbum, Jung, Jeesu, Hwang, Taewook, Namgoong, Hyuk, Seo, Hyein, Han, Yerim, Kim, Eunbin, Kang, Hyeonseok, Jung, Sangkeun

arXiv.org Artificial IntelligenceDec-2-2025

This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (Text-only, Image-only, Text+Figure) and prompt languages (Korean, English). The GPT-5 family models achieved perfect scores (100 points) under a limited set of language-modality configurations, while Grok 4, Qwen 3 235B, and Gemini 2.5 pro also scored above 97 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed Calculus as the weakest domain with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (82.6->100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a standardized digitization pipeline that converts human-targeted exam materials into LLM-ready evaluation data, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard; https://isoft.cnu.ac.kr/csat2026/

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.18649

Country:

North America > Dominican Republic (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

Rahman, Musfiqur, Khatoonabadi, SayedHassan, Shihab, Emad

arXiv.org Artificial IntelligenceNov-6-2025

Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks, yet real-world software development increasingly demands class-level implementations that integrate multiple methods, attributes, and dependencies within authentic project contexts. This gap between benchmark performance and practical utility raises critical questions about LLMs' readiness for production code assistance, particularly regarding their ability to generalize across familiar and novel codebases. We introduce a benchmark derived from real-world open-source repositories, comprising classes divided into seen and unseen partitions to evaluate generalization under practical conditions. We systematically examine how input specification completeness and retrieval-augmented generation affect class-level correctness across multiple state-of-the-art LLMs. Our evaluation reveals a substantial performance gap: while LLMs achieve 84 to 89% correctness on synthetic benchmarks, they attain only 25 to 34% on real-world class tasks, with minimal distinction between familiar and novel codebases. Comprehensive documentation provides marginal improvements (1 to 3%), whereas retrieval augmentation yields greater gains (4 to 7%) by supplying concrete implementation patterns. Error analysis identifies AttributeError, TypeError, and AssertionError as dominant failure modes, with distinct patterns between synthetic and real-world scenarios. These findings provide actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.2613

Country: North America > Canada > Quebec > Montreal (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.68)

Industry: Information Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Harnessing the Power of Large Language Models for Software Testing Education: A Focus on ISTQB Syllabus

Ngo, Tuan-Phong, Duong, Bao-Ngoc, Hoang, Tuan-Anh, Dwight, Joshua, Khwakhali, Ushik Shrestha

arXiv.org Artificial IntelligenceOct-28-2025

Software testing is a critical component in the software engineering field and is important for software engineering education. Thus, it is vital for academia to continuously improve and update educational methods to reflect the current state of the field. The International Software Testing Qualifications Board (ISTQB) certification framework is globally recognized and widely adopted in industry and academia. However, ISTQB-based learning has been rarely applied with recent generative artificial intelligence advances. Despite the growing capabilities of large language models (LLMs), ISTQB-based learning and instruction with LLMs have not been thoroughly explored. This paper explores and evaluates how LLMs can complement the ISTQB framework for higher education. The findings present four key contributions: (i) the creation of a comprehensive ISTQB-aligned dataset spanning over a decade, consisting of 28 sample exams and 1,145 questions; (ii) the development of a domain-optimized prompt that enhances LLM precision and explanation quality on ISTQB tasks; (iii) a systematic evaluation of state-of-the-art LLMs on this dataset; and (iv) actionable insights and recommendations for integrating LLMs into software testing education. These findings highlight the promise of LLMs in supporting ISTQB certification preparation and offer a foundation for their broader use in software engineering at higher education.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2510.22318

Country:

North America > United States > District of Columbia > Washington (0.05)
Asia > Vietnam > Hanoi > Hanoi (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Germany (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Education > Educational Setting > Higher Education (0.87)
Education > Curriculum > Subject-Specific Education (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability

Saju, Lorraine, Bleier, Arnim, Lasser, Jana, Wagner, Claudia

arXiv.org Artificial IntelligenceOct-22-2025

The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale. To whom correspondence should be addressed: lorraine.saju@gesis.org

accuracy, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2506.03655

Country:

South America > Brazil > São Paulo (0.04)
Asia > China > Hong Kong (0.04)
Asia > Middle East > Republic of Türkiye (0.04)
(13 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.93)
Health & Medicine > Therapeutic Area > Immunology (0.93)
Government > Regional Government > North America Government > United States Government (0.67)
Media > News (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures

Amouyal, Samuel Joseph, Meltzer-Asscher, Aya, Berant, Jonathan

arXiv.org Artificial IntelligenceOct-17-2025

Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.

directionality, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2510.07141

Country:

North America > United States (0.14)
Europe > Austria > Vienna (0.14)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

5ef9853a6cdea40ae3e301a6d8dc32b5-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 04:07:21 GMT

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country:

Europe > Austria > Vienna (0.14)
Europe > Italy (0.04)
Europe > United Kingdom > England (0.04)
(21 more...)

Genre: Research Report (0.67)

Industry:

Transportation > Air (1.00)
Media > Film (1.00)
Leisure & Entertainment > Sports > Soccer (1.00)
Transportation > Infrastructure & Services > Airport (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Performance of Large Language Models in Answering Critical Care Medicine Questions

Alwakeel, Mahmoud, Nagori, Aditya, Wong, An-Kwok Ian, Chaisson, Neal, Krishnamoorthy, Vijay, Kamaleswaran, Rishikesan

arXiv.org Artificial IntelligenceSep-25-2025

Abstract: Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains. Introduction: The use of Large Language Models (LLMs) to answer medical exam - style questions has gained popularity in recent years. This study aims to evaluate the performance of LLMs in answering subspecialty CCM board exam - style questions.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2509.19344

Country:

North America > United States > Ohio > Cuyahoga County > Cleveland (0.05)
North America > United States > North Carolina > Durham County > Durham (0.05)

Genre: Research Report (0.95)

Industry: Health & Medicine > Therapeutic Area (0.72)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback