contradict
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
Axiomatisation for an asynchronous epistemic logic with sending and receiving messages
Balbiani, Philippe, van Ditmarsch, Hans, Lerouvillois, Clara
We investigate a public announcement logic for asynchronous public announcements wherein the sending of the announcements by the environment is separated from the reception of the announcements by the individual agents. Both come with different modalities. In the logical semantics, formulas are interpreted in a world of a Kripke model but given a history of prior announcements and receptions of announcements that already happened. An axiomatisation AA for such a logic has been given in prior work, for the formulas that are valid when interpreted in the Kripke model before any such announcements have taken place. This axiomatisation is a reduction system wherein one can show that every formula is equivalent to a purely epistemic formula without dynamic modalities for announcements and receptions. We propose a generalisation AA* of this axiomatisation, for the formulas that are valid when interpreted in the Kripke model given any history of prior announcements and receptions of announcements. It does not extend the axiomatisation AA, for example it is no longer valid that nobody has received any announcement. Unlike AA, this axiomatisation AA* is infinitary and it is not a reduction system.
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
Existing LLMs Are Not Self-Consistent For Simple Tasks
Lin, Zhenru, Tao, Jiawen, Yuan, Yang, Yao, Andrew Chi-Chih
Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods -- a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.
- Europe > Austria > Vienna (0.14)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Washington (0.04)
- (6 more...)
Temporal Entailment Pretraining for Clinical Language Models over EHR Data
Tanaka, Tatsunori, Zheng, Fi, Sato, Kai, Li, Zhifeng, Zhang, Yuanyun, Li, Shi
Clinical language models have achieved strong performance on downstream tasks by pretraining on domain specific corpora such as discharge summaries and medical notes. However, most approaches treat the electronic health record as a static document, neglecting the temporally-evolving and causally entwined nature of patient trajectories. In this paper, we introduce a novel temporal entailment pretraining objective for language models in the clinical domain. Our method formulates EHR segments as temporally ordered sentence pairs and trains the model to determine whether a later state is entailed by, contradictory to, or neutral with respect to an earlier state. Through this temporally structured pretraining task, models learn to perform latent clinical reasoning over time, improving their ability to generalize across forecasting and diagnosis tasks. We pretrain on a large corpus derived from MIMIC IV and demonstrate state of the art results on temporal clinical QA, early warning prediction, and disease progression modeling.
Cross-Examiner: Evaluating Consistency of Large Language Model-Generated Explanations
Villa, Danielle, Chang, Maria, Murugesan, Keerthiram, Uceda-Sosa, Rosario, Ramamurthy, Karthikeyan Natesan
Large Language Models (LLMs) are often asked to explain their outputs to enhance accuracy and transparency. However, evidence suggests that these explanations can misrepresent the models' true reasoning processes. One effective way to identify inaccuracies or omissions in these explanations is through consistency checking, which typically involves asking follow-up questions. This paper introduces, cross-examiner, a new method for generating follow-up questions based on a model's explanation of an initial question. Our method combines symbolic information extraction with language model-driven question generation, resulting in better follow-up questions than those produced by LLMs alone. Additionally, this approach is more flexible than other methods and can generate a wider variety of follow-up questions.
FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models
Marinescu, Radu, Bhattacharjya, Debarun, Lee, Junkyu, Tchrakian, Tigran, Cano, Javier Carnerero, Hou, Yufang, Daly, Elizabeth, Pascale, Alessandra
Large language models (LLMs) have demonstrated vast capabilities on generative tasks in recent years, yet they struggle with guaranteeing the factual correctness of the generated content. This makes these models unreliable in realistic situations where factually accurate responses are expected. In this paper, we propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response. Specifically, FactReasoner decomposes the response into atomic units, retrieves relevant contexts for them from an external knowledge source, and constructs a joint probability distribution over the atoms and contexts using probabilistic encodings of the logical relationships (entailment, contradiction) between the textual utterances corresponding to the atoms and contexts. FactReasoner then computes the posterior probability of whether atomic units in the response are supported by the retrieved contexts. Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches in terms of both factual precision and recall.
- North America > United States (0.67)
- South America > Brazil (0.28)
- Europe > Middle East > Malta (0.14)
- (2 more...)
- Leisure & Entertainment (0.68)
- Media (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)
NGQA: A Nutritional Graph Question Answering Benchmark for Personalized Health-aware Nutritional Reasoning
Zhang, Zheyuan, Li, Yiyang, Le, Nhi Ha Lan, Wang, Zehong, Ma, Tianyi, Galassi, Vincent, Murugesan, Keerthiram, Moniz, Nuno, Geyer, Werner, Chawla, Nitesh V, Zhang, Chuxu, Ye, Yanfang
Diet plays a critical role in human health, yet tailoring dietary reasoning to individual health conditions remains a major challenge. Nutrition Question Answering (QA) has emerged as a popular method for addressing this problem. However, current research faces two critical limitations. On one hand, the absence of datasets involving user-specific medical information severely limits \textit{personalization}. This challenge is further compounded by the wide variability in individual health needs. On the other hand, while large language models (LLMs), a popular solution for this task, demonstrate strong reasoning abilities, they struggle with the domain-specific complexities of personalized healthy dietary reasoning, and existing benchmarks fail to capture these challenges. To address these gaps, we introduce the Nutritional Graph Question Answering (NGQA) benchmark, the first graph question answering dataset designed for personalized nutritional health reasoning. NGQA leverages data from the National Health and Nutrition Examination Survey (NHANES) and the Food and Nutrient Database for Dietary Studies (FNDDS) to evaluate whether a food is healthy for a specific user, supported by explanations of the key contributing nutrients. The benchmark incorporates three question complexity settings and evaluates reasoning across three downstream tasks. Extensive experiments with LLM backbones and baseline models demonstrate that the NGQA benchmark effectively challenges existing models. In sum, NGQA addresses a critical real-world problem while advancing GraphQA research with a novel domain-specific benchmark.
- Asia > Middle East > Republic of Türkiye (0.05)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- North America > United States > Connecticut (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories
Aroca-Ouellette, Stephane, Mackraz, Natalie, Theobald, Barry-John, Metcalf, Katherine
Accommodating human preferences is essential for creating AI agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs to infer preferences from user interactions, but they often produce broad and generic preferences, failing to capture the unique and individualized nature of human preferences. This paper introduces PREDICT, a method designed to enhance the precision and adaptability of inferring preferences. PREDICT incorporates three key elements: (1) iterative refinement of inferred preferences, (2) decomposition of preferences into constituent components, and (3) validation of preferences across multiple trajectories. We evaluate PREDICT on two distinct environments: a gridworld setting and a new text-domain environment (PLUME).
- North America > United States > Colorado > Boulder County > Boulder (0.14)
- Europe > Germany (0.04)
- North America > United States > New York (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints
An, Kaikai, Si, Shuzheng, Hu, Helan, Zhao, Haozhe, Wang, Yuchi, Guo, Qingyan, Chang, Baobao
Semantic Parsing aims to capture the meaning of a sentence and convert it into a logical, structured form. Previous studies show that semantic parsing enhances the performance of smaller models (e.g., BERT) on downstream tasks. However, it remains unclear whether the improvements extend similarly to LLMs. In this paper, our empirical findings reveal that, unlike smaller models, directly adding semantic parsing results into LLMs reduces their performance. To overcome this, we propose SENSE, a novel prompting approach that embeds semantic hints within the prompt. Experiments show that SENSE consistently improves LLMs' performance across various tasks, highlighting the potential of integrating semantic information to improve LLM capabilities.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)