AITopics | hallucination rate

Collaborating Authors

hallucination rate

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

Liu, Emmy, Gangal, Varun, Yu, Michael, Tao, Zhuofu, Singh, Karan, Kumar, Sachin, Feng, Steven Y.

arXiv.org Machine LearningMay-20-2026

Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across tasks such as summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting actually reduces hallucinations across contexts. Current hallucination benchmarks either require human annotation and fixed references that may eventually be memorized, or rely on naturalistic observations often recorded in settings that are difficult to reproduce or test systematically. To enable further research on the root causes of hallucination, we introduce HALLUWORLD, an extensible benchmark framework grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this reference world. Building on this view, we construct a family of synthetic and semi-synthetic benchmark environments in which the reference world is fully specified, the model's observable view is controlled, and hallucination labels can be generated automatically by construction. HALLUWORLD spans multiple settings that are classically representative for AI, i.e., gridworlds, chess, and realistic terminal tasks. This enables controlled variation of key factors such as world complexity, observability, temporal change, and source-conflict policy, allowing us to disentangle hallucinations into more fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns across domains: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation are still difficult for frontier models, and are not generally solved by extended thinking.

large language model, machine learning, natural language, (22 more...)

arXiv.org Machine Learning

2605.19341

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Chess (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)

Add feedback

c9b551a2e195a209fc0b280de2f7f781-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 03:41:00 GMT

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)
(6 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry: Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Estimating the Hallucination Rate of Generative AI Andrew Jesson Nicolas Beltran-V elez * Quentin Chu

Neural Information Processing SystemsFeb-11-2026, 04:01:08 GMT

In ICL, a conditional generative model (CGM) is prompted with a dataset and a prediction question and asked to generate a response.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.67)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Leisure & Entertainment (1.00)
Government > Regional Government > North America Government > United States Government (0.67)
Media > Film (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
(2 more...)

Add feedback

Estimating the Hallucination Rate of Generative AI

Neural Information Processing SystemsDec-24-2025, 23:12:50 GMT

This paper presents a method for estimating the hallucination rate for in-context learning (ICL) with generative AI. In ICL, a conditional generative model (CGM) is prompted with a dataset and a prediction question and asked to generate a response. One interpretation of ICL assumes that the CGM computes the posterior predictive of an unknown Bayesian model, which implicitly defines a joint distribution over observable datasets and latent mechanisms. This joint distribution factorizes into two components: the model prior over mechanisms and the model likelihood of datasets given a mechanism. With this perspective, we define a \textit{hallucination} as a generated response to the prediction question with low model likelihood given the mechanism. We develop a new method that takes an ICL problem and estimates the probability that a CGM will generate a hallucination. Our method only requires generating prediction questions and responses from the CGM and evaluating its response log probability. We empirically evaluate our method using large language models for synthetic regression and natural language ICL tasks.

artificial intelligence, machine learning, natural language, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Generation (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.67)

Add feedback

Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

Wannan, null, Yang, null, Qiu, Xinchi, Yu, Lei, Zhang, Yuchen, Yang, Aobo, Kokhlikyan, Narine, Cancedda, Nicola, Garcia-Olano, Diego

arXiv.org Artificial IntelligenceDec-9-2025

Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.02324

Country:

Europe (1.00)
Africa > South Africa (0.47)
Asia > Middle East (0.28)

Genre: Research Report (1.00)

Industry:

Media > Music (0.93)
Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs

Phillips, Edward, Wu, Sean, Molaei, Soheila, Belgrave, Danielle, Thakur, Anshul, Clifton, David

arXiv.org Artificial IntelligenceDec-3-2025

Large language models demonstrate impressive results across diverse tasks but are still known to hallucinate, generating linguistically plausible but incorrect answers to questions. Uncertainty quantification has been proposed as a strategy for hallucination detection, requiring estimates for both global uncertainty (attributed to a batch of responses) and local uncertainty (attributed to individual responses). While recent black-box approaches have shown some success, they often rely on disjoint heuristics or graph-theoretic approximations that lack a unified geometric interpretation. We introduce a geometric framework to address this, based on archetypal analysis of batches of responses sampled with only black-box model access. At the global level, we propose Geometric V olume, which measures the convex hull volume of archetypes derived from response embeddings. At the local level, we propose Geometric Suspicion, which leverages the spatial relationship between responses and these archetypes to rank reliability, enabling hallucination reduction through preferential response selection. Unlike prior methods that rely on discrete pairwise comparisons, our approach provides continuous semantic boundary points which have utility for attributing reliability to individual responses. Experiments show that our framework performs comparably to or better than prior methods on short form question-answering datasets, and achieves superior results on medical datasets where hallucinations carry particularly critical risks. We also provide theoretical justification by proving a link between convex hull volume and entropy. Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks (Guo et al., 2025; Anthropic, 2025; Gemini Team, Google DeepMind, 2025; OpenAI, 2025) and are increasingly applied in areas such as medical diagnosis, law, and financial advice (Y ang et al., 2025; Chen et al., 2024; Kong et al., 2024). Hallucinations, however, where models generate plausible but false or fabricated content, pose significant risks for adoption in high-stakes applications (Farquhar et al., 2024). Recent work, for example, finds GPT -4 hallucinating in 28.6% of reference generation tasks (Chelli et al., 2024).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.13813

Country: Asia (0.46)

Genre: Research Report (1.00)

Industry: Health & Medicine > Diagnostic Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Beyond Component Strength: Synergistic Integration and Adaptive Calibration in Multi-Agent RAG Systems

Krishnan, Jithin

arXiv.org Artificial IntelligenceDec-1-2025

Building reliable retrieval-augmented generation (RAG) systems requires more than adding powerful components; it requires understanding how they interact. Using ablation studies on 50 queries (15 answerable, 10 edge cases, and 25 adversarial), we show that enhancements such as hybrid retrieval, ensemble verification, and adaptive thresholding provide almost no benefit when used in isolation, yet together achieve a 95% reduction in abstention (from 40% to 2%) without increasing hallucinations. We also identify a measurement challenge: different verification strategies can behave safely but assign inconsistent labels (for example, "abstained" versus "unsupported"), creating apparent hallucination rates that are actually artifacts of labeling. Our results show that synergistic integration matters more than the strength of any single component, that standardized metrics and labels are essential for correctly interpreting performance, and that adaptive calibration is needed to prevent overconfident over-answering even when retrieval quality is high.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.21729

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

Hildebrand, Samuel, Taylor, Curtis, Oesch, Sean, Ghawaly, James M Jr, Sadovnik, Amir, Shivers, Ryan, Schreiber, Brandon, Kurian, Kevin

arXiv.org Artificial IntelligenceNov-21-2025

Abstract--Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree"). Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT -Battelle, LLC, for the U. S. Department of Energy. Notice: This manuscript has been authored by UT -Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.08945

Country: North America > United States > Louisiana > East Baton Rouge Parish > Baton Rouge (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications

Xu, Yifan, Gupta, Vipul, Aggarwal, Rohit, Mahadevan, Varsha, Krishnamachari, Bhaskar

arXiv.org Artificial IntelligenceNov-20-2025

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by pulling in external material, document, code, manuals, from vast and ever-growing corpora, to effectively answer user queries. The effectiveness of RAG depends significantly on aligning the number of retrieved documents with query characteristics: narrowly focused queries typically require fewer, highly relevant documents, whereas broader or ambiguous queries benefit from retrieving more extensive supporting information. However, the common static top-k retrieval approach fails to adapt to this variability, resulting in either insufficient context from too few documents or redundant information from too many. Motivated by these challenges, we introduce Cluster-based Adaptive Retrieval (CAR), an algorithm that dynamically determines the optimal number of documents by analyzing the clustering patterns of ordered query-document similarity distances. CAR detects the transition point within similarity distances, where tightly clustered, highly relevant documents shift toward less pertinent candidates, establishing an adaptive cut-off that scales with query complexity. On Coinbase's CDP corpus and the public MultiHop-RAG benchmark, CAR consistently picks the optimal retrieval depth and achieves the highest TES score, outperforming every fixed top-k baseline. In downstream RAG evaluations, CAR cuts LLM token usage by 60%, trims end-to-end latency by 22%, and reduces hallucinations by 10% while fully preserving answer relevance. Since integrating CAR into Coinbase's virtual assistant, we've seen user engagement jump by 200%.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2511.14769

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (0.82)

Industry: Banking & Finance > Trading (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

Jackson, Declan, Keating, William, Cameron, George, Hill-Smith, Micah

arXiv.org Artificial IntelligenceNov-18-2025

We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.13029

Country:

North America > United States > California (0.28)
North America > United States > Arkansas (0.28)

Genre: Research Report (0.52)

Industry:

Law (0.93)
Health & Medicine (0.93)
Government > Regional Government > North America Government > United States Government (0.46)
Materials > Metals & Mining (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback