AITopics | camera wearer

Collaborating Authors

camera wearer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

d611d5c0251d9680f869c5d2c46c6fcd-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-18-2026, 07:22:57 GMT

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Tennessee > Shelby County > Memphis (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Asia > China > Zhejiang Province (0.04)
Africa > South Sudan > Greater Upper Nile > Greater Pibor Administrative Area > Boma (0.04)

Industry:

Education (0.93)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.68)
Information Technology (0.67)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
(3 more...)

Add feedback

Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset Wang Lin

Neural Information Processing SystemsOct-10-2025, 17:55:33 GMT

Traditional emotion analysis often takes a third-person perspective, limiting the ability of agents to interact naturally and empathetically.

dataset, emotion, video, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Tennessee > Shelby County > Memphis (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Asia > China > Zhejiang Province (0.04)
Africa > South Sudan > Greater Upper Nile > Greater Pibor Administrative Area > Boma (0.04)

Industry:

Education (0.93)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.68)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
(2 more...)

Add feedback

5f2809607f692d79a01c05c43d702883-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 04:07:48 GMT

benchmark, camera wearer, video, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.05)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > United States > Maryland > Baltimore (0.04)
(2 more...)

Genre: Research Report > New Finding (0.45)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(6 more...)

Add feedback

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Chen, Mingfei, Cui, Zijun, Liu, Xiulong, Xiang, Jinlin, Zheng, Caleb, Li, Jingyuan, Shlizerman, Eli

arXiv.org Artificial IntelligenceJun-9-2025

3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii) Dynamic Global Map Construction, which aggregates multi-modal queried object trajectories and converts them into a unified global dynamic map. Using the constructed map, a final QA answer is obtained through a coordinate transformation that aligns the global map with the queried viewpoint. Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.05414

Genre: Research Report (0.81)

Industry: Information Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

Li, Yuxuan, Veerabadran, Vijay, Iuzzolino, Michael L., Roads, Brett D., Celikyilmaz, Asli, Ridgeway, Karl

arXiv.org Artificial IntelligenceMar-28-2025

We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind (ToM) evaluation to egocentric domains. Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions. We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems. Our evaluation shows that MLLMs achieve close to human-level accuracy on inferring goals from egocentric videos. However, MLLMs (including the largest ones we tested with over 100B parameters) fall short of human performance when inferring the camera wearers' in-the-moment belief states and future actions that are most consistent with the unseen video future. We believe that our results will shape the future design of an important class of egocentric digital assistants which are equipped with a reasonable model of the user's internal mental states.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.22152

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

HourVideo: 1-Hour Video-Language Understanding

Chandrasegaran, Keshigeyan, Gupta, Agrim, Hadzic, Lea M., Kota, Taran, He, Jimming, Eyzaguirre, Cristóbal, Durante, Zane, Li, Manling, Wu, Jiajun, Fei-Fei, Li

arXiv.org Artificial IntelligenceNov-7-2024

We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at https://hourvideo.stanford.edu

benchmark, camera wearer, video, (14 more...)

arXiv.org Artificial Intelligence

2411.04998

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.24)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > United States > Maryland > Baltimore (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Education (0.66)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.90)

Add feedback

Efficient In-Context Learning in Vision-Language Models for Egocentric Videos

Yu, Keunwoo Peter, Zhang, Zheyuan, Hu, Fengyuan, Chai, Joyce

arXiv.org Artificial IntelligenceNov-29-2023

Recent advancements in text-only large language models (LLMs) have highlighted the benefit of in-context learning for adapting to new tasks with a few demonstrations. However, extending in-context learning to large vision-language models (VLMs) using a huge amount of naturalistic vision-language data has shown limited success, particularly for egocentric videos, due to high data collection costs. We propose a novel training method $\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos ($\mathbb{EILEV}$), which elicits in-context learning in VLMs for egocentric videos without requiring massive, naturalistic egocentric video datasets. $\mathbb{EILEV}$ involves architectural and training data adaptations to allow the model to process contexts interleaved with video clips and narrations, sampling of in-context examples with clusters of similar verbs and nouns, use of data with skewed marginal distributions with a long tail of infrequent verbs and nouns, as well as homonyms and synonyms. Our evaluations show that $\mathbb{EILEV}$-trained models outperform larger VLMs trained on a huge amount of naturalistic data in in-context learning. Furthermore, they can generalize to not only out-of-distribution, but also novel, rare egocentric videos and texts via in-context learning, demonstrating potential for applications requiring cost-effective training, and rapid post-deployment adaptability. Our code and demo are available at \url{https://github.com/yukw777/EILEV}.

in-context learning, learning, video clips, (15 more...)

arXiv.org Artificial Intelligence

2311.17041

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

EgoBody: Human Body Shape, Motion and Social Interactions from Head-Mounted Devices

Zhang, Siwei, Ma, Qianli, Zhang, Yan, Qian, Zhiyin, Pollefeys, Marc, Bogo, Federica, Tang, Siyu

arXiv.org Artificial IntelligenceDec-14-2021

Understanding social interactions from first-person views is crucial for many applications, ranging from assistive robotics to AR/VR. A first step for reasoning about interactions is to understand human pose and shape. However, research in this area is currently hindered by the lack of data. Existing datasets are limited in terms of either size, annotations, ground-truth capture modalities or the diversity of interactions. We address this shortcoming by proposing EgoBody, a novel large-scale dataset for social interactions in complex 3D scenes. We employ Microsoft HoloLens2 headsets to record rich egocentric data streams (including RGB, depth, eye gaze, head and hand tracking). To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames, reconstructing 3D human poses and shapes relative to the scene. We collect 68 sequences, spanning diverse sociological interaction categories, and propose the first benchmark for 3D full-body pose and shape estimation from egocentric views. Our dataset and code will be available for research at https://sanweiliti.github.io/egobody/egobody.html.

computer vision, interaction, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2112.07642

Country:

Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.51)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.92)

Add feedback

Teaching AI to perceive the world through your eyes

#artificialintelligenceOct-16-2021, 09:30:13 GMT

AI that understands the world from a first-person point of view could unlock a new era of immersive experiences, as devices like augmented reality (AR) glasses and virtual reality (VR) headsets become as useful in everyday life as smartphones. Imagine your AR device displaying exactly how to hold the sticks during a drum lesson, guiding you through a recipe, helping you find your lost keys, or recalling memories as holograms that come to life in front of you. To build these new technologies, we need to teach AI to understand and interact with the world like we do, from a first-person perspective -- commonly referred to in the research community as egocentric perception. Today's computer vision (CV) systems, however, typically learn from millions of photos and videos that are captured in third-person perspective, where the camera is just a spectator to the action. "Next-generation AI systems will need to learn from an entirely different kind of data -- videos that show the world from the center of the action, rather than the sidelines," says Kristen Grauman, lead research scientist at Facebook.

ai assistant, egocentric perception, university, (15 more...)

#artificialintelligence

Country:

Asia > Singapore (0.05)
South America > Colombia (0.04)
North America > United States > Pennsylvania (0.04)
(8 more...)

Genre: Research Report (0.67)

Technology:

Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

Facebook: Here comes the AI of the Metaverse

#artificialintelligenceOct-15-2021, 02:16:11 GMT

To operate in augmented and virtual reality, Facebook believes artificial intelligence will need to develop an "egocentric perspective." To that end, the company on Thursday announced Ego4D, a data set of 2,792 hours of first-person video, and a set of benchmark tests for neural nets, designed to encourage the development of AI that is savvier about what it's like to move through virtual worlds from a first-person perspective. The project is a collaboration between Facebook Reality Labs and scholars from 13 research institutions, including academic institutions and research labs. The details are laid out in a paper lead-authored by Facebook's Kristen Grauman, "Ego4D: Around the World in 2.8K Hours of Egocentric Video." Grauman is a scientist with the company's Facebook AI Research unit.

facebook, neural net, video, (14 more...)

#artificialintelligence

Industry: Information Technology > Services (0.52)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.42)

Add feedback