Goto

Collaborating Authors

 first-person video



Comparing Learning Paradigms for Egocentric Video Summarization

Wen, Daniel

arXiv.org Artificial Intelligence

In this study, we investigate various computer vision paradigms - supervised learning, unsupervised learning, and prompt fine-tuning - by assessing their ability to understand and interpret egocentric video data. Specifically, we examine Shotluck Holmes (state-of-the-art supervised learning), TAC-SUM (state-of-the-art unsupervised learning), and GPT-4o (a prompt fine-tuned pre-trained model), evaluating their effectiveness in video summarization. Our results demonstrate that current state-of-the-art models perform less effectively on first-person videos compared to third-person videos, highlighting the need for further advancements in the egocentric video domain. Notably, a prompt fine-tuned general-purpose GPT-4o model outperforms these specialized models, emphasizing the limitations of existing approaches in adapting to the unique challenges of first-person perspectives. Although our evaluation is conducted on a small subset of egocentric videos from the Ego-Exo4D dataset due to resource constraints, the primary objective of this research is to provide a comprehensive proof-of-concept analysis aimed at advancing the application of computer vision techniques to first-person videos. By exploring novel methodologies and evaluating their potential, we aim to contribute to the ongoing development of models capable of effectively processing and interpreting egocentric perspectives.


Challenges and Trends in Egocentric Vision: A Survey

Li, Xiang, Qiu, Heqian, Wang, Lanxiao, Zhang, Hanwen, Qi, Chenghao, Han, Linfeng, Xiong, Huiyu, Li, Hongliang

arXiv.org Artificial Intelligence

With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.


Terrifyingly, Facebook wants its AI to be your eyes and ears

#artificialintelligence

Facebook has announced a research project that aims to push the "frontier of first-person perception", and in the process help you remember where you left your keys. The Ego4D project provides a huge collection of first-person video and related data, plus a set of challenges for researchers to teach computers to understand the data and gather useful information from it. In September, the social media giant launched a line of "smart glasses" called Ray-Ban Stories, which carry a digital camera and other features. Much like the Google Glass project, which met mixed reviews in 2013, this one has prompted complaints of privacy invasion. Tickets to TNW Conference 2022 are available now!

  Country: Asia > China (0.05)
  Industry:

Facebook wants machines to see the world through our eyes

#artificialintelligence

For the last two years, Facebook AI Research (FAIR) has worked with 13 universities around the world to assemble the largest ever data set of first-person video--specifically to train deep-learning image-recognition models. AIs trained on the data set will be better at controlling robots that interact with people, or interpreting images from smart glasses. "Machines will be able to help us in our daily lives only if they really understand the world through our eyes," says Kristen Grauman at FAIR, who leads the project. Such tech could support people who need assistance around the home, or guide people in tasks they are learning to complete. "The video in this data set is much closer to how humans observe the world," says Michael Ryoo, a computer vision researcher at Google Brain and Stony Brook University in New York, who is not involved in Ego4D.


CMU Helps Compile Largest Collection of First-Person Videos

CMU School of Computer Science

Researchers at Carnegie Mellon University helped compile and will have access to the largest collection of point-of-view videos in the world. These videos could enable artificial intelligence to understand the world from a first-person point of view and unlock a new wave of virtual assistants, augmented reality and robotics. Until now, most of the video used to train computer vision models came from the third-person point of view. The first-person, or egocentric, video included in this collection will allow researchers to train computer vision systems to see the world as humans do. "For the first time, we'll have enough data to be able to teach computers to see what we see," said Kris Kitani, an associate research professor in the Robotics Institute who led CMU's efforts to collect data.


Learning Robot Activities from First-Person Human Videos Using Convolutional Future Regression

Lee, Jangwon, Ryoo, Michael S.

arXiv.org Artificial Intelligence

We design a new approach that allows robot learning of new activities from unlabeled human example videos. Given videos of humans executing the same activity from a human's viewpoint (i.e., first-person videos), our objective is to make the robot learn the temporal structure of the activity as its future regression network, and learn to transfer such model for its own motor execution. We present a new deep learning model: We extend the state-of-the-art convolutional object detection network for the representation/estimation of human hands in training videos, and newly introduce the concept of using a fully convolutional network to regress (i.e., predict) the intermediate scene representation corresponding to the future frame (e.g., 1-2 seconds later). Combining these allows direct prediction of future locations of human hands and objects, which enables the robot to infer the motor control plan using our manipulation network. We experimentally confirm that our approach makes learning of robot activities from unlabeled human interaction videos possible, and demonstrate that our robot is able to execute the learned collaborative activities in real-time directly based on its camera input.