Goto

Collaborating Authors

 contradict



Axiomatisation for an asynchronous epistemic logic with sending and receiving messages

Balbiani, Philippe, van Ditmarsch, Hans, Lerouvillois, Clara

arXiv.org Artificial Intelligence

We investigate a public announcement logic for asynchronous public announcements wherein the sending of the announcements by the environment is separated from the reception of the announcements by the individual agents. Both come with different modalities. In the logical semantics, formulas are interpreted in a world of a Kripke model but given a history of prior announcements and receptions of announcements that already happened. An axiomatisation AA for such a logic has been given in prior work, for the formulas that are valid when interpreted in the Kripke model before any such announcements have taken place. This axiomatisation is a reduction system wherein one can show that every formula is equivalent to a purely epistemic formula without dynamic modalities for announcements and receptions. We propose a generalisation AA* of this axiomatisation, for the formulas that are valid when interpreted in the Kripke model given any history of prior announcements and receptions of announcements. It does not extend the axiomatisation AA, for example it is no longer valid that nobody has received any announcement. Unlike AA, this axiomatisation AA* is infinitary and it is not a reduction system.



Existing LLMs Are Not Self-Consistent For Simple Tasks

Lin, Zhenru, Tao, Jiawen, Yuan, Yang, Yao, Andrew Chi-Chih

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods -- a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.


Temporal Entailment Pretraining for Clinical Language Models over EHR Data

Tanaka, Tatsunori, Zheng, Fi, Sato, Kai, Li, Zhifeng, Zhang, Yuanyun, Li, Shi

arXiv.org Artificial Intelligence

Clinical language models have achieved strong performance on downstream tasks by pretraining on domain specific corpora such as discharge summaries and medical notes. However, most approaches treat the electronic health record as a static document, neglecting the temporally-evolving and causally entwined nature of patient trajectories. In this paper, we introduce a novel temporal entailment pretraining objective for language models in the clinical domain. Our method formulates EHR segments as temporally ordered sentence pairs and trains the model to determine whether a later state is entailed by, contradictory to, or neutral with respect to an earlier state. Through this temporally structured pretraining task, models learn to perform latent clinical reasoning over time, improving their ability to generalize across forecasting and diagnosis tasks. We pretrain on a large corpus derived from MIMIC IV and demonstrate state of the art results on temporal clinical QA, early warning prediction, and disease progression modeling.


Cross-Examiner: Evaluating Consistency of Large Language Model-Generated Explanations

Villa, Danielle, Chang, Maria, Murugesan, Keerthiram, Uceda-Sosa, Rosario, Ramamurthy, Karthikeyan Natesan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are often asked to explain their outputs to enhance accuracy and transparency. However, evidence suggests that these explanations can misrepresent the models' true reasoning processes. One effective way to identify inaccuracies or omissions in these explanations is through consistency checking, which typically involves asking follow-up questions. This paper introduces, cross-examiner, a new method for generating follow-up questions based on a model's explanation of an initial question. Our method combines symbolic information extraction with language model-driven question generation, resulting in better follow-up questions than those produced by LLMs alone. Additionally, this approach is more flexible than other methods and can generate a wider variety of follow-up questions.


FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

Marinescu, Radu, Bhattacharjya, Debarun, Lee, Junkyu, Tchrakian, Tigran, Cano, Javier Carnerero, Hou, Yufang, Daly, Elizabeth, Pascale, Alessandra

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated vast capabilities on generative tasks in recent years, yet they struggle with guaranteeing the factual correctness of the generated content. This makes these models unreliable in realistic situations where factually accurate responses are expected. In this paper, we propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response. Specifically, FactReasoner decomposes the response into atomic units, retrieves relevant contexts for them from an external knowledge source, and constructs a joint probability distribution over the atoms and contexts using probabilistic encodings of the logical relationships (entailment, contradiction) between the textual utterances corresponding to the atoms and contexts. FactReasoner then computes the posterior probability of whether atomic units in the response are supported by the retrieved contexts. Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches in terms of both factual precision and recall.


NGQA: A Nutritional Graph Question Answering Benchmark for Personalized Health-aware Nutritional Reasoning

Zhang, Zheyuan, Li, Yiyang, Le, Nhi Ha Lan, Wang, Zehong, Ma, Tianyi, Galassi, Vincent, Murugesan, Keerthiram, Moniz, Nuno, Geyer, Werner, Chawla, Nitesh V, Zhang, Chuxu, Ye, Yanfang

arXiv.org Artificial Intelligence

Diet plays a critical role in human health, yet tailoring dietary reasoning to individual health conditions remains a major challenge. Nutrition Question Answering (QA) has emerged as a popular method for addressing this problem. However, current research faces two critical limitations. On one hand, the absence of datasets involving user-specific medical information severely limits \textit{personalization}. This challenge is further compounded by the wide variability in individual health needs. On the other hand, while large language models (LLMs), a popular solution for this task, demonstrate strong reasoning abilities, they struggle with the domain-specific complexities of personalized healthy dietary reasoning, and existing benchmarks fail to capture these challenges. To address these gaps, we introduce the Nutritional Graph Question Answering (NGQA) benchmark, the first graph question answering dataset designed for personalized nutritional health reasoning. NGQA leverages data from the National Health and Nutrition Examination Survey (NHANES) and the Food and Nutrient Database for Dietary Studies (FNDDS) to evaluate whether a food is healthy for a specific user, supported by explanations of the key contributing nutrients. The benchmark incorporates three question complexity settings and evaluates reasoning across three downstream tasks. Extensive experiments with LLM backbones and baseline models demonstrate that the NGQA benchmark effectively challenges existing models. In sum, NGQA addresses a critical real-world problem while advancing GraphQA research with a novel domain-specific benchmark.


PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories

Aroca-Ouellette, Stephane, Mackraz, Natalie, Theobald, Barry-John, Metcalf, Katherine

arXiv.org Artificial Intelligence

Accommodating human preferences is essential for creating AI agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs to infer preferences from user interactions, but they often produce broad and generic preferences, failing to capture the unique and individualized nature of human preferences. This paper introduces PREDICT, a method designed to enhance the precision and adaptability of inferring preferences. PREDICT incorporates three key elements: (1) iterative refinement of inferred preferences, (2) decomposition of preferences into constituent components, and (3) validation of preferences across multiple trajectories. We evaluate PREDICT on two distinct environments: a gridworld setting and a new text-domain environment (PLUME).


Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints

An, Kaikai, Si, Shuzheng, Hu, Helan, Zhao, Haozhe, Wang, Yuchi, Guo, Qingyan, Chang, Baobao

arXiv.org Artificial Intelligence

Semantic Parsing aims to capture the meaning of a sentence and convert it into a logical, structured form. Previous studies show that semantic parsing enhances the performance of smaller models (e.g., BERT) on downstream tasks. However, it remains unclear whether the improvements extend similarly to LLMs. In this paper, our empirical findings reveal that, unlike smaller models, directly adding semantic parsing results into LLMs reduces their performance. To overcome this, we propose SENSE, a novel prompting approach that embeds semantic hints within the prompt. Experiments show that SENSE consistently improves LLMs' performance across various tasks, highlighting the potential of integrating semantic information to improve LLM capabilities.