Multimodal Contextualized Semantic Parsing from Speech
Voas, Jordan, Mooney, Raymond, Harwath, David
–arXiv.org Artificial Intelligence
We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.
arXiv.org Artificial Intelligence
Jun-10-2024
- Country:
- Asia
- China > Hong Kong (0.04)
- Middle East
- Jordan (0.04)
- Republic of Türkiye > Karaman Province
- Karaman (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.04)
- Europe
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ireland (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Croatia > Dubrovnik-Neretva County
- North America
- Dominican Republic (0.04)
- United States > Texas
- Travis County > Austin (0.04)
- Asia
- Genre:
- Research Report (0.82)
- Technology: