Goto

Collaborating Authors

 conflicting


ClaimIQ at CheckThat! 2025: Comparing Prompted and Fine-Tuned Language Models for Verifying Numerical Claims

arXiv.org Artificial Intelligence

This paper presents our system for Task 3 of the CLEF 2025 CheckThat! Lab, which focuses on verifying numerical and temporal claims using retrieved evidence. We explore two complementary approaches: zero-shot prompting with instruction-tuned large language models (LLMs) and supervised fine-tuning using parameter-efficient LoRA. To enhance evidence quality, we investigate several selection strategies, including full-document input and top-k sentence filtering using BM25 and MiniLM. Our best-performing model LLaMA fine-tuned with LoRA achieves strong performance on the English validation set. However, a notable drop in the test set highlights a generalization challenge. These findings underscore the importance of evidence granularity and model adaptation for robust numerical fact verification.


DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification

arXiv.org Artificial Intelligence

Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.


Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

arXiv.org Artificial Intelligence

The rise of Large Language Models (LLMs) has significantly advanced many applications on software engineering tasks, particularly in code generation. Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from users' intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications. Existing work mainly focuses on investing the hallucination in the domain of natural language generation (NLG), leaving a gap in understanding the types and extent of hallucinations in the context of code generation. To bridge the gap, we conducted a thematic analysis of the LLM-generated code to summarize and categorize the hallucinations present in it. Our study established a comprehensive taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness. Based on the results, we proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations. Hallucination recognition and mitigation experiments with HalluCode and HumanEval show existing LLMs face great challenges in recognizing hallucinations, particularly in identifying their types, and are hardly able to mitigate hallucinations. We believe our findings will shed light on future research about hallucination evaluation, detection, and mitigation, ultimately paving the way for building more effective and reliable code LLMs in the future.


On Dealing with Conflicting, Uncertain and Partially Ordered Ontologies

AAAI Conferences

We focus on handling conflicting and uncertain information in lightweight ontologies, where uncertainty is represented in a possibilistic logic setting. We use DL-Lite, a tractable fragment of Description Logic, to specify terminological knowledge (i.e., TBox). We assume the TBox to be stable and coherent, while its combination with a set of assertional facts (i.e., ABox) may be inconsistent. We address the problem of dealing with conflicts when the reliability relation between sources is only partially ordered. We propose to represent the uncertain ABox as a symbolic weighted base, where a strict partial preorder is applied on the weights. In this context, we provide a strategy for computing a single repair for the ABox, called the partial possibilistic repair. The idea is to consider all compatible bases of a partially preordered ABox (which intuitively encode total extensions of the partial preorder), compute their associated possibilistic repairs, before intersecting those repairs. We define the notion of π-accepted assertions and provide an equivalent characterization, therefore ensuring tractable computations of our method.


The Two (Conflicting) Definitions of AI

#artificialintelligence

Summary: There are two definitions currently in use for AI, the popular definition and the data science definition and they conflict in fundamental ways. If you're going to explain or recommend AI to a non-data scientist, it's important to understand the difference. For a profession as concerned with accuracy as we are, we do a really poor job at naming things, or at least being consistent in the naming. "Big Data" – totally misleading (since it incorporates velocity and variety in addition to volume). How many times have you had to correct someone on that?