ambiguity type
From Disagreement to Understanding: The Case for Ambiguity Detection in NLI
Jayaweera, Chathuri, Dorr, Bonnie J.
This position paper argues that annotation disagreement in Natural Language Inference (NLI) is not mere noise but often reflects meaningful variation, especially when triggered by ambiguity in the premise or hypothesis. While underspecified guidelines and annotator behavior contribute to variation, content-based ambiguity provides a process-independent signal of divergent human perspectives. We call for a shift toward ambiguity-aware NLI that first identifies ambiguous input pairs, classifies their types, and only then proceeds to inference. To support this shift, we present a framework that incorporates ambiguity detection and classification prior to inference. We also introduce a unified taxonomy that synthesizes existing taxonomies, illustrates key subtypes with examples, and motivates targeted detection methods that better align models with human interpretation. Although current resources lack datasets explicitly annotated for ambiguity and subtypes, this gap presents an opportunity: by developing new annotated resources and exploring unsupervised approaches to ambiguity detection, we enable more robust, explainable, and human-aligned NLI systems.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Florida > Alachua County > Gainesville (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (9 more...)
LLM-based ambiguity detection in natural language instructions for collaborative surgical robots
Davila, Ana, Colan, Jacinto, Hasegawa, Yasuhisa
Ambiguity in natural language instructions poses significant risks in safety-critical human-robot interaction, particularly in domains such as surgery. To address this, we propose a framework that uses Large Language Models (LLMs) for ambiguity detection specifically designed for collaborative surgical scenarios. Our method employs an ensemble of LLM evaluators, each configured with distinct prompting techniques to identify linguistic, contextual, procedural, and critical ambiguities. A chain-of-thought evaluator is included to systematically analyze instruction structure for potential issues. Individual evaluator assessments are synthesized through conformal prediction, which yields non-conformity scores based on comparison to a labeled calibration dataset. Evaluating Llama 3.2 11B and Gemma 3 12B, we observed classification accuracy exceeding 60% in differentiating ambiguous from unambiguous surgical instructions. Our approach improves the safety and reliability of human-robot collaboration in surgery by offering a mechanism to identify potentially ambiguous instructions before robot action.
- Asia > Japan (0.05)
- North America > United States (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment
Ivanova, Anastasiia, Bakaeva, Eva, Volovikova, Zoya, Kovalev, Alexey K., Panov, Aleksandr I.
As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.
- Asia > Russia (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Dominican Republic (0.04)
- (3 more...)
- Education (0.67)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis > Beverages (0.48)
Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding
Subbiah, Melanie, Mishra, Akankshya, Kim, Grace, Tang, Liyan, Durrett, Greg, McKeown, Kathleen
Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada (0.04)
- Asia > Singapore (0.04)
- (9 more...)
A Taxonomy of Ambiguity Types for NLP
Li, Margaret Y., Liu, Alisa, Wu, Zhaofeng, Smith, Noah A.
Ambiguity is an critical component of language that allows for more effective communication between speakers, but is often ignored in NLP. Recent work suggests that NLP systems may struggle to grasp certain elements of human language understanding because they may not handle ambiguities at the level that humans naturally do in communication. Additionally, different types of ambiguity may serve different purposes and require different approaches for resolution, and we aim to investigate how language models' abilities vary across types. We propose a taxonomy of ambiguity types as seen in English to facilitate NLP analysis. Our taxonomy can help make meaningful splits in language ambiguity data, allowing for more fine-grained assessments of both datasets and model performance.
- Europe > Norway (0.06)
- North America > United States > New York (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- (2 more...)
Zero and Few-shot Semantic Parsing with Ambiguous Inputs
Stengel-Eskin, Elias, Rawlins, Kyle, Van Durme, Benjamin
Despite the ubiquity of ambiguity in natural language, it is often ignored or deliberately removed in semantic parsing tasks, which generally assume that a given surface form has only one correct logical form. We attempt to address this shortcoming by introducing AmP, a framework, dataset, and challenge for parsing with linguistic ambiguity. We define templates and generate data for five well-documented linguistic ambiguities. Using AmP, we investigate how several few-shot semantic parsing systems handle ambiguity, introducing three new metrics. We find that large pre-trained models perform poorly at capturing the distribution of possible meanings without deliberate instruction. However, models are able to capture distribution well when ambiguity is attested in their inputs. These results motivate a call for ambiguity to be explicitly included in semantic parsing, and promotes considering the distribution of possible outputs when evaluating semantic parsing systems.