Goto

Collaborating Authors

 camera wearer





SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Chen, Mingfei, Cui, Zijun, Liu, Xiulong, Xiang, Jinlin, Zheng, Caleb, Li, Jingyuan, Shlizerman, Eli

arXiv.org Artificial Intelligence

3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii) Dynamic Global Map Construction, which aggregates multi-modal queried object trajectories and converts them into a unified global dynamic map. Using the constructed map, a final QA answer is obtained through a coordinate transformation that aligns the global map with the queried viewpoint. Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.


EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

Li, Yuxuan, Veerabadran, Vijay, Iuzzolino, Michael L., Roads, Brett D., Celikyilmaz, Asli, Ridgeway, Karl

arXiv.org Artificial Intelligence

We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind (ToM) evaluation to egocentric domains. Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions. We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems. Our evaluation shows that MLLMs achieve close to human-level accuracy on inferring goals from egocentric videos. However, MLLMs (including the largest ones we tested with over 100B parameters) fall short of human performance when inferring the camera wearers' in-the-moment belief states and future actions that are most consistent with the unseen video future. We believe that our results will shape the future design of an important class of egocentric digital assistants which are equipped with a reasonable model of the user's internal mental states.


HourVideo: 1-Hour Video-Language Understanding

Chandrasegaran, Keshigeyan, Gupta, Agrim, Hadzic, Lea M., Kota, Taran, He, Jimming, Eyzaguirre, Cristóbal, Durante, Zane, Li, Manling, Wu, Jiajun, Fei-Fei, Li

arXiv.org Artificial Intelligence

We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at https://hourvideo.stanford.edu


Efficient In-Context Learning in Vision-Language Models for Egocentric Videos

Yu, Keunwoo Peter, Zhang, Zheyuan, Hu, Fengyuan, Chai, Joyce

arXiv.org Artificial Intelligence

Recent advancements in text-only large language models (LLMs) have highlighted the benefit of in-context learning for adapting to new tasks with a few demonstrations. However, extending in-context learning to large vision-language models (VLMs) using a huge amount of naturalistic vision-language data has shown limited success, particularly for egocentric videos, due to high data collection costs. We propose a novel training method $\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos ($\mathbb{EILEV}$), which elicits in-context learning in VLMs for egocentric videos without requiring massive, naturalistic egocentric video datasets. $\mathbb{EILEV}$ involves architectural and training data adaptations to allow the model to process contexts interleaved with video clips and narrations, sampling of in-context examples with clusters of similar verbs and nouns, use of data with skewed marginal distributions with a long tail of infrequent verbs and nouns, as well as homonyms and synonyms. Our evaluations show that $\mathbb{EILEV}$-trained models outperform larger VLMs trained on a huge amount of naturalistic data in in-context learning. Furthermore, they can generalize to not only out-of-distribution, but also novel, rare egocentric videos and texts via in-context learning, demonstrating potential for applications requiring cost-effective training, and rapid post-deployment adaptability. Our code and demo are available at \url{https://github.com/yukw777/EILEV}.


EgoBody: Human Body Shape, Motion and Social Interactions from Head-Mounted Devices

Zhang, Siwei, Ma, Qianli, Zhang, Yan, Qian, Zhiyin, Pollefeys, Marc, Bogo, Federica, Tang, Siyu

arXiv.org Artificial Intelligence

Understanding social interactions from first-person views is crucial for many applications, ranging from assistive robotics to AR/VR. A first step for reasoning about interactions is to understand human pose and shape. However, research in this area is currently hindered by the lack of data. Existing datasets are limited in terms of either size, annotations, ground-truth capture modalities or the diversity of interactions. We address this shortcoming by proposing EgoBody, a novel large-scale dataset for social interactions in complex 3D scenes. We employ Microsoft HoloLens2 headsets to record rich egocentric data streams (including RGB, depth, eye gaze, head and hand tracking). To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames, reconstructing 3D human poses and shapes relative to the scene. We collect 68 sequences, spanning diverse sociological interaction categories, and propose the first benchmark for 3D full-body pose and shape estimation from egocentric views. Our dataset and code will be available for research at https://sanweiliti.github.io/egobody/egobody.html.


Teaching AI to perceive the world through your eyes

#artificialintelligence

AI that understands the world from a first-person point of view could unlock a new era of immersive experiences, as devices like augmented reality (AR) glasses and virtual reality (VR) headsets become as useful in everyday life as smartphones. Imagine your AR device displaying exactly how to hold the sticks during a drum lesson, guiding you through a recipe, helping you find your lost keys, or recalling memories as holograms that come to life in front of you. To build these new technologies, we need to teach AI to understand and interact with the world like we do, from a first-person perspective -- commonly referred to in the research community as egocentric perception. Today's computer vision (CV) systems, however, typically learn from millions of photos and videos that are captured in third-person perspective, where the camera is just a spectator to the action. "Next-generation AI systems will need to learn from an entirely different kind of data -- videos that show the world from the center of the action, rather than the sidelines," says Kristen Grauman, lead research scientist at Facebook.


Facebook: Here comes the AI of the Metaverse

#artificialintelligence

To operate in augmented and virtual reality, Facebook believes artificial intelligence will need to develop an "egocentric perspective." To that end, the company on Thursday announced Ego4D, a data set of 2,792 hours of first-person video, and a set of benchmark tests for neural nets, designed to encourage the development of AI that is savvier about what it's like to move through virtual worlds from a first-person perspective. The project is a collaboration between Facebook Reality Labs and scholars from 13 research institutions, including academic institutions and research labs. The details are laid out in a paper lead-authored by Facebook's Kristen Grauman, "Ego4D: Around the World in 2.8K Hours of Egocentric Video." Grauman is a scientist with the company's Facebook AI Research unit.