What's in the Box? Reasoning about Unseen Objects from Multimodal Cues

Ying, Lance, Xu, Daniel, Zhang, Alicia, Collins, Katherine M., Siegel, Max H., Tenenbaum, Joshua B.

Jun-18-2025–arXiv.org Artificial Intelligence

People regularly make inferences about objects in the world that they cannot see by flexibly integrating information from multiple sources: auditory and visual cues, language, and our prior beliefs and knowledge about the scene. How are we able to so flexibly integrate many sources of information to make sense of the world around us, even if we have no direct knowledge? In this work, we propose a neurosymbolic model that uses neural networks to parse open-ended multi-modal inputs and then applies a Bayesian model to integrate different sources of information to evaluate different hypotheses. We evaluate our model with a novel object guessing game called "What's in the Box?" where humans and models watch a video clip of an experimenter shaking boxes and then try to guess the objects inside the boxes. Through a human experiment, we show that our model correlates strongly with human judgments, whereas unimodal ablated models and large multi-modal neural model baselines showed poor correlation.

artificial intelligence, information, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Jun-18-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Massachusetts (0.14)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning
    - Uncertainty > Bayesian Inference (1.00)
    - Belief Revision (0.87)
  - Machine Learning > Learning Graphical Models
    - Directed Networks > Bayesian Learning (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found