presupposition
LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
Sieker, Judith, Lachenmaier, Clara, Zarrieß, Sina
This paper examines how LLMs handle false presuppositions and whether certain linguistic factors influence their responses to falsely presupposed content. Presuppositions subtly introduce information as given, making them highly effective at embedding disputable or false information. This raises concerns about whether LLMs, like humans, may fail to detect and correct misleading assumptions introduced as false presuppositions, even when the stakes of misinformation are high. Using a systematic approach based on linguistic presupposition analysis, we investigate the conditions under which LLMs are more or less sensitive to adopt or reject false presuppositions. Focusing on political contexts, we examine how factors like linguistic construction, political party, and scenario probability impact the recognition of false presuppositions. We conduct experiments with a newly created dataset and examine three LLMs: OpenAI's GPT-4-o, Meta's LLama-3-8B, and MistralAI's Mistral-7B-v03. Our results show that the models struggle to recognize false presuppositions, with performance varying by condition. This study highlights that linguistic presupposition analysis is a valuable tool for uncovering the reinforcement of political misinformation in LLM responses.
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Germany (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.95)
- Media > News (1.00)
- Government (0.66)
P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication
Oram, Sneha, Bhattacharyya, Pushpak
Although explainability and interpretability have received significant attention in artificial intelligence (AI) and natural language processing (NLP) for mental health, reasoning has not been examined in the same depth. Addressing this gap is essential to bridge NLP and mental health through interpretable and reasoning-capable AI systems. To this end, we investigate the pragmatic reasoning capability of large-language models (LLMs) in the mental health domain. We introduce PRiMH dataset, and propose pragmatic reasoning tasks in mental health with pragmatic implicature and presupposition phenomena. In particular, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the tasks presented, we consider four models: Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning abilities in the domain. Subsequently, we study the behavior of MentaLLaMA on the proposed reasoning tasks with the rollout attention mechanism. In addition, we also propose three StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT4o-mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with stigma more responsibly compared to the other two LLMs.
- North America > United States (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Safer in Translation? Presupposition Robustness in Indic Languages
Palnitkar, Aadi, Suresh, Arjun, Rajesh, Rishi, Puli, Puneet
Increasingly, more and more people are turning to large language models (LLMs) for healthcare advice and consultation, making it important to gauge the efficacy and accuracy of the responses of LLMs to such queries. While there are pre-existing medical benchmarks literature which seeks to accomplish this very task, these benchmarks are almost universally in English, which has led to a notable gap in existing literature pertaining to multilingual LLM evaluation. Within this work, we seek to aid in addressing this gap with Cancer-Myth-Indic, an Indic language benchmark built by translating a 500-item subset of Cancer-Myth, sampled evenly across its original categories, into five under-served but widely used languages from the subcontinent (500 per language; 2,500 translated items total). Native-speaker translators followed a style guide for preserving implicit presuppositions in translation; items feature false presuppositions relating to cancer. We evaluate several popular LLMs under this presupposition stress.
- Asia > India (0.15)
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- Asia > Indonesia > Bali (0.04)
Implementing a Logical Inference System for Japanese Comparatives
Mikami, Yosuke, Matsuoka, Daiki, Yanaka, Hitomi
Natural Language Inference (NLI) involving comparatives is challenging because it requires understanding quantities and comparative relations expressed by sentences. While some approaches leverage Large Language Models (LLMs), we focus on logic-based approaches grounded in compositional semantics, which are promising for robust handling of numerical and logical expressions. Previous studies along these lines have proposed logical inference systems for English comparatives. However, it has been pointed out that there are several morphological and semantic differences between Japanese and English comparatives. These differences make it difficult to apply such systems directly to Japanese comparatives. To address this gap, this study proposes ccg-jcomp, a logical inference system for Japanese comparatives based on compositional semantics. We evaluate the proposed system on a Japanese NLI dataset containing comparative expressions. We demonstrate the effectiveness of our system by comparing its accuracy with that of existing LLMs.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (3 more...)
Measuring Sycophancy of Language Models in Multi-turn Dialogues
Hong, Jiseung, Byun, Grace, Kim, Seungone, Shu, Kai, Choi, Jinho D.
Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.
- Europe > Ukraine > Crimea (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Energy > Renewable (1.00)
- Energy > Power Industry (0.68)
Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
Lachenmaier, Clara, Sieker, Judith, Zarrieß, Sina
Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other's beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don't) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine the ability of LLMs to answer direct knowledge questions and loaded questions that presuppose misinformation. We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias. Our findings highlight significant challenges in LLMs' ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse.
- Media > News (1.00)
- Government (1.00)
They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse
Paci, Walter, Panunzi, Alessandro, Pezzelle, Sandro
Implicit content plays a crucial role in political discourse, where speakers systematically employ pragmatic strategies such as implicatures and presuppositions to influence their audiences. Large Language Models (LLMs) have demonstrated strong performance in tasks requiring complex semantic and pragmatic understanding, highlighting their potential for detecting and explaining the meaning of implicit content. However, their ability to do this within political discourse remains largely underexplored. Leveraging, for the first time, the large IMPAQTS corpus, which comprises Italian political speeches with the annotation of manipulative implicit content, we propose methods to test the effectiveness of LLMs in this challenging problem. Through a multiple-choice task and an open-ended generation task, we demonstrate that all tested models struggle to interpret presuppositions and implicatures. We conclude that current LLMs lack the key pragmatic capabilities necessary for accurately interpreting highly implicit language, such as that found in political discourse. At the same time, we highlight promising trends and future directions for enhancing model performance. We release our data and code at https://github.com/WalterPaci/IMPAQTS-PID
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Russia (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (7 more...)
Conjoined Predication and Scalar Implicature
Magri (2016) has discussed two puzzles raised by conjunction. While the first puzzle has not been resolved, a solution to the second puzzle has been proposed by Magri. The first puzzle conceals an interrelationship between quantification, collective/concurrent interpretation, and contextual updates, the aspects of which have not been explored. In brief, the puzzle is that certain variants of sentences such as # Some Italians come from a warm country involving conjunction as in # (Only) Some Italians come from a warm country and are blond remain odd despite the fact that no alternative seems to trigger the mismatching scalar implicature. In this paper, we o ffer a conceptual analysis of Magri's first puzzle, by first presenting it in the context of th e theory in which it arises . This paper proposes that the oddness arises due to the collective - concurrent interpretation of the conjunctive predicate, as underlined in # (Only) Some Italians come from a warm country and are blond that ends up giving rise to an indirect contextual contradiction. It is suggested that the generation of scalar implicatures may have pragmatically governed facets not fully conditioned by accounts of exhaustification - based grammatical licensing of scalar implicatures . Introduction Magri (2016) has discussed two puzzles raised by conjunction which we discuss in brief.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Portugal (0.05)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (4 more...)
Let's CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition
Azin, Tara, Dumitrescu, Daniel, Inkpen, Diana, Singh, Raj
Natural Language Inference (NLI) is the task of determining whether a sentence pair represents entailment, contradiction, or a neutral relationship. While NLI models perform well on many inference tasks, their ability to handle fine-grained pragmatic inferences, particularly presupposition in conditionals, remains underexplored. In this study, we introduce CONFER, a novel dataset designed to evaluate how NLI models process inference in conditional sentences. We assess the performance of four NLI models, including two pre-trained models, to examine their generalization to conditional reasoning. Additionally, we evaluate Large Language Models (LLMs), including GPT-4o, LLaMA, Gemma, and DeepSeek-R1, in zero-shot and few-shot prompting settings to analyze their ability to infer presuppositions with and without prior context. Our findings indicate that NLI models struggle with presuppositional reasoning in conditionals, and fine-tuning on existing NLI datasets does not necessarily improve their performance.
- North America > Canada > Ontario > Toronto (0.04)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (3 more...)
ChatGPT for President! Presupposed content in politicians versus GPT-generated texts
Garassino, Davide, Brocca, Nicola, Masia, Viviana
This study examines ChatGPT-4's capability to replicate linguistic strategies used in political discourse, focusing on its potential for manipulative language generation. As large language models become increasingly popular for text generation, concerns have grown regarding their role in spreading fake news and propaganda. This research compares real political speeches with those generated by ChatGPT, emphasizing presuppositions (a rhetorical device that subtly influences audiences by packaging some content as already known at the moment of utterance, thus swaying opinions without explicit argumentation). Using a corpus-based pragmatic analysis, this study assesses how well ChatGPT can mimic these persuasive strategies. The findings reveal that although ChatGPT-generated texts contain many manipulative presuppositions, key differences emerge in their frequency, form, and function compared with those of politicians. For instance, ChatGPT often relies on change-of-state verbs used in fixed phrases, whereas politicians use presupposition triggers in more varied and creative ways. Such differences, however, are challenging to detect with the naked eye, underscoring the potential risks posed by large language models in political and public discourse.Using a corpus-based pragmatic analysis, this study assesses how well ChatGPT can mimic these persuasive strategies. The findings reveal that although ChatGPT-generated texts contain many manipulative presuppositions, key differences emerge in their frequency, form, and function compared with those of politicians. For instance, ChatGPT often relies on change-of-state verbs used in fixed phrases, whereas politicians use presupposition triggers in more varied and creative ways. Such differences, however, are challenging to detect with the naked eye, underscoring the potential risks posed by large language models in political and public discourse.
- Europe > France (0.68)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (5 more...)
- Media > News (0.66)
- Government > Regional Government > Europe Government > France Government (0.46)