Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents
–Neural Information Processing Systems
There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.
Neural Information Processing Systems
Jun-15-2026, 16:19:15 GMT
- Country:
- Asia > China (0.28)
- North America > United States (0.28)
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (0.92)
- Research Report
- Industry:
- Leisure & Entertainment (1.00)
- Health & Medicine > Consumer Health (1.00)
- Education (1.00)
- Energy (0.93)
- Consumer Products & Services (0.93)
- Media > Music (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Natural Language > Large Language Model (0.95)
- Machine Learning > Neural Networks (0.68)
- Speech > Speech Recognition (0.67)
- Information Technology > Artificial Intelligence