Multimodal Contextualized Semantic Parsing from Speech

Voas, Jordan, Mooney, Raymond, Harwath, David

Jun-10-2024–arXiv.org Artificial Intelligence

We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.

dataset, graph, information, (13 more...)

arXiv.org Artificial Intelligence

Jun-10-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States > Texas
    - Travis County > Austin (0.04)
- Europe
  - Ireland (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
- Asia
  - Singapore (0.04)
  - China > Hong Kong (0.04)
  - Myanmar > Tanintharyi Region
    - Dawei (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
    - Republic of Türkiye > Karaman Province
      - Karaman (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Grammars & Parsing (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found