AITopics | inconsistency detection

Collaborating Authors

inconsistency detection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LLM-Assisted Formalization Enables Deterministic Detection of Statutory Inconsistency in the Internal Revenue Code

Yadamsuren, Borchuluun, Platt, Steven Keith, Diaz, Miguel

arXiv.org Artificial IntelligenceNov-18-2025

This study introduces a hybrid neuro-symbolic framework that achieves deterministic detection of statutory inconsistency in complex law. We use the U.S. Internal Revenue Code (IRC) as a case study because its complexity makes it a fertile domain for identifying conflicts. Our research offers a solution for detecting inconsistent provisions by combining Large Language Models (LLMs) with symbolic logic. LLM-based methods can support compliance, fairness, and statutory drafting, yet tax-specific applications remain sparse. A key challenge is that such models struggle with hierarchical processing and deep structured reasoning, especially over long text. This research addresses these gaps through experiments using GPT-4o, GPT-5, and Prolog. GPT-4o was first used to translate Section 121 into Prolog rules and refine them in SWISH. These rules were then incorporated into prompts to test whether Prolog-augmented prompting improved GPT-4o's inconsistency detection. GPT-4o, whether prompted with natural language alone or with Prolog augmentation, detected the inconsistency in only one of three strategies (33 percent accuracy), but its reasoning quality differed: natural-language prompting achieved 100 percent rule coverage, while Prolog-augmented prompting achieved 66 percent, indicating more incomplete statutory analysis. In contrast to probabilistic prompting, the hybrid Prolog model produced deterministic and reproducible results. Guided by GPT-5 for refinement, the model formalized the IRC section's competing interpretations and successfully detected an inconsistency zone. Validation tests confirm that the Prolog implementation is accurate, internally consistent, deterministic, and capable of autonomously identifying inconsistencies. These findings show that LLM-assisted formalization, anchored in symbolic logic, enables transparent and reliable statutory inconsistency detection.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.11954

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Government > Tax (0.95)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

Semnani, Sina J., Burapacheep, Jirayu, Khatua, Arpandeep, Atchariyachanvanit, Thanawan, Wang, Zheng, Lam, Monica S.

arXiv.org Artificial IntelligenceSep-30-2025

Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it? We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time. Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%. Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.23233

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (0.67)
Government (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection

Sagimbayeva, Nursulu, Bahçeci, Ruveyda Betül, Weber, Ingmar

arXiv.org Artificial IntelligenceMay-27-2025

Inconsistent political statements represent a form of misinformation. They erode public trust and pose challenges to accountability, when left unnoticed. Detecting inconsistencies automatically could support journalists in asking clarification questions, thereby helping to keep politicians accountable. We propose the Inconsistency detection task and develop a scale of inconsistency types to prompt NLP-research in this direction. To provide a resource for detecting inconsistencies in a political domain, we present a dataset of 698 human-annotated pairs of political statements with explanations of the annotators' reasoning for 237 samples. The statements mainly come from voting assistant platforms such as Wahl-O-Mat in Germany and Smartvote in Switzerland, reflecting real-world political issues. We benchmark Large Language Models (LLMs) on our dataset and show that in general, they are as good as humans at detecting inconsistencies, and might be even better than individual humans at predicting the crowd-annotated ground-truth. However, when it comes to identifying fine-grained inconsistency types, none of the model have reached the upper bound of performance (due to natural labeling variation), thus leaving room for improvement. We make our dataset and code publicly available.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.19191

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset

Mascarell, Laura, Chalumattu, Ribin, Rios, Annette

arXiv.org Artificial IntelligenceMar-14-2024

The advent of Large Language Models (LLMs) has led to remarkable progress on a wide range of natural language processing tasks. Despite the advances, these large-sized models still suffer from hallucinating information in their output, which poses a major issue in automatic text summarization, as we must guarantee that the generated summary is consistent with the content of the source document. Previous research addresses the challenging task of detecting hallucinations in the output (i.e. inconsistency detection) in order to evaluate the faithfulness of the generated summaries. However, these works primarily focus on English and recent multilingual approaches lack German data. This work presents absinth, a manually annotated dataset for hallucination detection in German news summarization and explores the capabilities of novel open-source LLMs on this task in both fine-tuning and in-context learning settings. We open-source and release the absinth dataset to foster further research on hallucination detection in German.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.3929/ethz-b-000661775

2403.0375

Country:

Europe > Switzerland > Zürich > Zürich (0.05)
Asia > Singapore (0.05)
North America > Canada > Ontario > Toronto (0.04)
(7 more...)

Genre:

Research Report (0.82)
Personal > Honors (0.69)

Industry: Media > News (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.77)

Add feedback

SIFiD: Reassess Summary Factual Inconsistency Detection with LLM

Yang, Jiuding, Liu, Hui, Guo, Weidong, Rao, Zhuwei, Xu, Yu, Niu, Di

arXiv.org Artificial IntelligenceMar-12-2024

Ensuring factual consistency between the summary and the original document is paramount in summarization tasks. Consequently, considerable effort has been dedicated to detecting inconsistencies. With the advent of Large Language Models (LLMs), recent studies have begun to leverage their advanced language understanding capabilities for inconsistency detection. However, early attempts have shown that LLMs underperform traditional models due to their limited ability to follow instructions and the absence of an effective detection methodology. In this study, we reassess summary inconsistency detection with LLMs, comparing the performances of GPT-3.5 and GPT-4. To advance research in LLM-based inconsistency detection, we propose SIFiD (Summary Inconsistency Detection with Filtered Document) that identify key sentences within documents by either employing natural language inference or measuring semantic similarity between summaries and documents.

detection, gpt-3, inconsistency detection, (12 more...)

arXiv.org Artificial Intelligence

2403.07557

Country: North America > Canada > Alberta (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Prospects for inconsistency detection using large language models and sheaves

Huntsman, Steve, Robinson, Michael, Huntsman, Ludmilla

arXiv.org Artificial IntelligenceJan-29-2024

We demonstrate that large language models can produce reasonable numerical ratings of the logical consistency of claims. We also outline a mathematical approach based on sheaf theory for lifting such ratings to hypertexts such as laws, jurisprudence, and social media and evaluating their consistency globally. This approach is a promising avenue to increasing consistency in and of government, as well as to combating mis- and disinformation and related ills.

arxiv, consistency, language model, (14 more...)

arXiv.org Artificial Intelligence

2401.16713

Country:

North America > United States > Pennsylvania (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)

Genre: Research Report (0.64)

Industry:

Law (0.94)
Government > Regional Government > North America Government > United States Government (0.93)
Government > Military (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model

Jia, Qi, Ren, Siyu, Liu, Yizhu, Zhu, Kenny Q.

arXiv.org Artificial IntelligenceDec-14-2023

Despite tremendous improvements in natural language generation, summarization models still suffer from the unfaithfulness issue. Previous work evaluates faithfulness either using models trained on the other tasks or in-domain synthetic data, or prompting a large model such as ChatGPT. This paper proposes to do zero-shot faithfulness evaluation simply with a moderately-sized foundation language model. We introduce a new metric FFLM, which is a combination of probability changes based on the intuition that prefixing a piece of text that is consistent with the output will increase the probability of predicting the output. Experiments show that FFLM performs competitively with or even outperforms ChatGPT on both inconsistency detection and faithfulness rating with 24x fewer parameters. FFLM also achieves improvements over other strong baselines.

dataset, evaluation, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2310.11648

Country:

North America > United States > Texas (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Fast and Accurate Factual Inconsistency Detection Over Long Documents

Lattimer, Barrett Martin, Chen, Patrick, Zhang, Xinyuan, Yang, Yi

arXiv.org Artificial IntelligenceOct-22-2023

SCALE consists of two crucial components. Large Language Models (LLMs) have shown immense First, it builds on a Natural language inference promise in various applications, but deploying (NLI) based method, integrating a novel them in real-time presents certain challenges chunking mechanism for rapid and accurate online such as hallucinations (Cao et al., 2018; Falke et al., performance in diverse natural language generation 2019; Kryściński et al., 2019; Fabbri et al., 2021a; (NLG) tasks. Second, model explanation is Honovich et al., 2022). Hallucinations, or factual essential for real-time deployment of inconsistency inconsistencies generated by a model relative to detection systems, facilitating swift human inspection a source document, can mislead the user and undermine to determine model configurations. We show trust in LLMs. Thus, detecting factual that our chunking mechanism improves calibration inconsistency in LLM generations is crucial for scores and enables the use of a binary search tree the future of LLMs, especially with the growing algorithm for rapidly locating relevant source text popularity of platforms like ChatGPT.

computational linguistic, dataset, screeneval, (12 more...)

arXiv.org Artificial Intelligence

2310.13189

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
North America > Canada (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Interpretable Automatic Fine-grained Inconsistency Detection in Text Summarization

Chan, Hou Pong, Zeng, Qi, Ji, Heng

arXiv.org Artificial IntelligenceMay-23-2023

Existing factual consistency evaluation approaches for text summarization provide binary predictions and limited insights into the weakness of summarization systems. Therefore, we propose the task of fine-grained inconsistency detection, the goal of which is to predict the fine-grained types of factual errors in a summary. Motivated by how humans inspect factual inconsistency in summaries, we propose an interpretable fine-grained inconsistency detection model, FineGrainFact, which explicitly represents the facts in the documents and summaries with semantic frames extracted by semantic role labeling, and highlights the related semantic frames to predict inconsistency. The highlighted semantic frames help verify predicted error types and correct inconsistent summaries. Experiment results demonstrate that our model outperforms strong baselines and provides evidence to support or refute the summary.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2305.14548

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > United Kingdom > Northern Ireland (0.05)
Asia > Macao (0.04)
(12 more...)

Genre: Research Report > New Finding (0.34)

Industry: Government > Regional Government > North America Government > United States Government (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

Code Comment Inconsistency Detection with BERT and Longformer

Steiner, Theo, Zhang, Rui

arXiv.org Artificial IntelligenceJul-28-2022

Comments, or natural language descriptions of source code, are standard practice among software developers. By communicating important aspects of the code such as functionality and usage, comments help with software project maintenance. However, when the code is modified without an accompanying correction to the comment, an inconsistency between the comment and code can arise, which opens up the possibility for developer confusion and bugs. In this paper, we propose two models based on BERT (Devlin et al., 2019) and Longformer (Beltagy et al., 2020) to detect such inconsistencies in a natural language inference (NLI) context. Through an evaluation on a previously established corpus of comment-method pairs both during and after code changes, we demonstrate that our models outperform multiple baselines and yield comparable results to the state-of-the-art models that exclude linguistic and lexical features. We further discuss ideas for future research in using pretrained language models for both inconsistency detection and automatic comment updating.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2207.14444

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.05)
North America > Puerto Rico (0.04)
(2 more...)

Genre: Research Report (0.70)

Industry: Information Technology (0.48)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback