AITopics | reference resolution

Collaborating Authors

reference resolution

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Visual Reference Resolution using Attention Memory for Visual Dialog

Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal

Neural Information Processing SystemsNov-21-2025, 10:12:09 GMT

Neural Information Processing Systems http://nips.cc/

history, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States > California > Los Angeles County > Long Beach (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception

Atuhurra, Jesse, Kamigaito, Hidetaka, Watanabe, Taro, Yoshino, Koichiro

arXiv.org Artificial IntelligenceOct-28-2025

We introduce J-ORA, a novel multimodal dataset that bridges the gap in robot perception by providing detailed object attribute annotations within Japanese human-robot dialogue scenarios. J-ORA is designed to support three critical perception tasks, object identification, reference resolution, and next-action prediction, by leveraging a comprehensive template of attributes (e.g., category, color, shape, size, material, and spatial relations). Extensive evaluations with both proprietary and open-source Vision Language Models (VLMs) reveal that incorporating detailed object attributes substantially improves multimodal perception performance compared to without object attributes. Despite the improvement, we find that there still exists a gap between proprietary and open-source VLMs. In addition, our analysis of object affordances demonstrates varying abilities in understanding object functionality and contextual relationships across different VLMs. These findings underscore the importance of rich, context-sensitive attribute annotations in advancing robot perception in dynamic environments. See project page at https://jatuhurrra.github.io/J-ORA/.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.21761

Country:

North America > United States > California (0.04)
North America > Dominican Republic (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures

Inadumi, Shun, Ueda, Nobuhiro, Yoshino, Koichiro

arXiv.org Artificial IntelligenceJun-3-2025

Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with coreference resolution performs better in pronoun phrase grounding than representative models for this task, MDETR and GLIP. Our qualitative analysis demonstrates that incorporating textual reference relations strengthens the confidence scores between mentions, including pronouns and predicates, and objects, which can reduce the ambiguities that arise in visually grounded dialogues.

machine learning, natural language, resolution, (18 more...)

arXiv.org Artificial Intelligence

2505.11726

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)

Add feedback

I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue

Ghaleb, Esam, Khaertdinov, Bulat, Özyürek, Aslı, Fernández, Raquel

arXiv.org Artificial IntelligenceFeb-27-2025

In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.

reference resolution, representation, speech, (16 more...)

arXiv.org Artificial Intelligence

2503.00071

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Netherlands > Limburg > Maastricht (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.35)

Add feedback

Visual Reference Resolution using Attention Memory for Visual Dialog

Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal

Neural Information Processing SystemsOct-3-2024, 13:14:17 GMT

Neural Information Processing Systems http://nips.cc/

dialog, history, reference resolution, (13 more...)

Neural Information Processing Systems

Country: North America > United States > California > Los Angeles County > Long Beach (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

ReALM: Reference Resolution As Language Modeling

Moniz, Joel Ruben Antony, Krishnan, Soundarya, Ozyildirim, Melis, Saraf, Prathamesh, Ates, Halim Cagri, Zhang, Yuan, Yu, Hong, Rajshree, Nidhi

arXiv.org Artificial IntelligenceMar-29-2024

Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.

query, resolution, user request, (15 more...)

arXiv.org Artificial Intelligence

2403.20329

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Tennessee > Shelby County > Memphis (0.04)

Genre: Research Report (0.64)

Industry:

Media > Film (0.56)
Leisure & Entertainment (0.56)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution

Ueda, Nobuhiro, Habe, Hideko, Matsui, Yoko, Yuguchi, Akishige, Kawano, Seiya, Kawanishi, Yasutomo, Kurohashi, Sadao, Yoshino, Koichiro

arXiv.org Artificial IntelligenceMar-28-2024

Understanding expressions that refer to the physical world is crucial for such human-assisting systems in the real world, as robots that must perform actions that are expected by users. In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views. To this end, we propose a multimodal reference resolution task and construct a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric video and dialogue audio of real-world conversations between two people acting as a master and an assistant robot at home. The dataset is annotated with crossmodal tags between phrases in the utterances and the object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model and clarified the challenges in multimodal reference resolution tasks.

dataset, relation, resolution, (14 more...)

arXiv.org Artificial Intelligence

2403.19259

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.05)
North America > Dominican Republic (0.04)
(10 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.91)

Add feedback

Reference Resolution and Context Change in Multimodal Situated Dialogue for Exploring Data Visualizations

Kumar, Abhinav, Di Eugenio, Barbara, Bhattacharya, Abari, Aurisano, Jillian, Johnson, Andrew

arXiv.org Artificial IntelligenceSep-6-2022

Reference resolution, which aims to identify entities being referred to by a speaker, is more complex in real world settings: new referents may be created by processes the agents engage in and/or be salient only because they belong to the shared physical setting. Our focus is on resolving references to visualizations on a large screen display in multimodal dialogue; crucially, reference resolution is directly involved in the process of creating new visualizations. We describe our annotations for user references to visualizations appearing on a large screen via language and hand gesture and also new entity establishment, which results from executing the user request to create a new visualization. We also describe our reference resolution pipeline which relies on an information-state architecture to maintain dialogue context. We report results on detecting and resolving references, effectiveness of contextual information on the model, and under-specified requests for creating visualizations. We also experiment with conventional CRF and deep learning / transformer models (BiLSTM-CRF and BERT-CRF) for tagging references in user utterance text. Our results show that transfer learning significantly boost performance of the deep learning methods, although CRF still out-performs them, suggesting that conventional methods may generalize better for low resource data.

proceedings, text reference, visualization, (15 more...)

arXiv.org Artificial Intelligence

2209.02215

Country:

North America > United States > Illinois > Cook County > Chicago (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Delaware > New Castle County > Newark (0.04)
Europe > Portugal > Faro > Faro (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Annotated Corpus of Reference Resolution for Interpreting Common Grounding

Udagawa, Takuma, Aizawa, Akiko

arXiv.org Artificial IntelligenceNov-18-2019

Common grounding is the process of creating, repairing and updating mutual understandings, which is a fundamental aspect of natural language conversation. However, interpreting the process of common grounding is a challenging task, especially under continuous and partially-observable context where complex ambiguity, uncertainty, partial understandings and misunderstandings are introduced. Interpretation becomes even more challenging when we deal with dialogue systems which still have limited capability of natural language understanding and generation. To address this problem, we consider reference resolution as the central subtask of common grounding and propose a new resource to study its intermediate process. Based on a simple and general annotation schema, we collected a total of 40,172 referring expressions in 5,191 dialogues curated from an existing corpus, along with multiple judgements of referent interpretations. We show that our annotation is highly reliable, captures the complexity of common grounding through a natural degree of reasonable disagreements, and allows for more detailed and quantitative analyses of common grounding strategies. Finally, we demonstrate the advantages of our annotation for interpreting, analyzing and improving common grounding in baseline dialogue systems.

annotation, artificial intelligence, natural language, (15 more...)

arXiv.org Artificial Intelligence

1911.07588

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
(7 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

On the Winograd Schema: Situating Language Understanding in the Data-Information-Knowledge Continuum

Saba, Walid (Astound.ai)

AAAI ConferencesMay-15-2019

The Winograd Schema (WS) challenge has been proposed as an alternative to the Turing Test as a test for machine intelligence. In this paper we ‘situate’ the WS challenge in the data-information-knowledge continuum, suggesting in the process what a good WS is. Subsequently, we will argue that the WS is but a special case of a more general phenomenon in language understanding, namely the phenomenon of the ‘missing text’. In particular, we will argue that what we usually call thinking in the process of language understanding almost always involves discovering some missing text - text is rarely explicitly stated but is implicitly assumed as shared background knowledge. As such, we suggest extending the WS challenge to include other linguistic phenomena that also involve discovering the ‘missing text’, such tests metonymy, quantifier scope, lexical disambiguation, and copredication, to name a few.

background knowledge, knowledge, winograd schema, (15 more...)

AAAI Conferences

The Thirty-Second International Flairs Conference

Country:

North America > United States > New York (0.04)
North America > United States > California > San Mateo County > Menlo Park (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)

Industry:

Leisure & Entertainment (0.68)
Media (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Issues > Turing's Test (0.68)

Add feedback