AITopics | first-person video

Collaborating Authors

first-person video

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Self Validation Network for Object-Level Human Attention Estimation

Zehua Zhang, Chen Yu, David Crandall

Neural Information Processing SystemsOct-2-2025, 14:42:05 GMT

Due to the foveated nature of the human vision system, people can focus their visual attention on only a small region of their visual field at a time, which usually contains a single object.

computer vision, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Indiana (0.04)
North America > Canada (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Cognitive Science (0.93)
(2 more...)

Add feedback

Comparing Learning Paradigms for Egocentric Video Summarization

Wen, Daniel

arXiv.org Artificial IntelligenceJun-30-2025

In this study, we investigate various computer vision paradigms - supervised learning, unsupervised learning, and prompt fine-tuning - by assessing their ability to understand and interpret egocentric video data. Specifically, we examine Shotluck Holmes (state-of-the-art supervised learning), TAC-SUM (state-of-the-art unsupervised learning), and GPT-4o (a prompt fine-tuned pre-trained model), evaluating their effectiveness in video summarization. Our results demonstrate that current state-of-the-art models perform less effectively on first-person videos compared to third-person videos, highlighting the need for further advancements in the egocentric video domain. Notably, a prompt fine-tuned general-purpose GPT-4o model outperforms these specialized models, emphasizing the limitations of existing approaches in adapting to the unique challenges of first-person perspectives. Although our evaluation is conducted on a small subset of egocentric videos from the Ego-Exo4D dataset due to resource constraints, the primary objective of this research is to provide a comprehensive proof-of-concept analysis aimed at advancing the application of computer vision techniques to first-person videos. By exploring novel methodologies and evaluating their potential, we aim to contribute to the ongoing development of models capable of effectively processing and interpreting egocentric perspectives.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.21785

Country: North America > United States > Minnesota (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Media (0.31)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Challenges and Trends in Egocentric Vision: A Survey

Li, Xiang, Qiu, Heqian, Wang, Lanxiao, Zhang, Hanwen, Qi, Chenghao, Han, Linfeng, Xiong, Huiyu, Li, Hongliang

arXiv.org Artificial IntelligenceMar-19-2025

With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.

data mining, large language model, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2503.15275

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)
(13 more...)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.45)
Instructional Material > Course Syllabus & Notes (0.45)

Industry:

Leisure & Entertainment (1.00)
Information Technology (1.00)
Health & Medicine > Therapeutic Area (0.67)
(3 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (1.00)
Information Technology > Hardware (1.00)
(10 more...)

Add feedback

Terrifyingly, Facebook wants its AI to be your eyes and ears

#artificialintelligenceOct-22-2021, 08:00:10 GMT

Facebook has announced a research project that aims to push the "frontier of first-person perception", and in the process help you remember where you left your keys. The Ego4D project provides a huge collection of first-person video and related data, plus a set of challenges for researchers to teach computers to understand the data and gather useful information from it. In September, the social media giant launched a line of "smart glasses" called Ray-Ban Stories, which carry a digital camera and other features. Much like the Google Glass project, which met mixed reviews in 2013, this one has prompted complaints of privacy invasion. Tickets to TNW Conference 2022 are available now!

facebook, first-person video, video, (15 more...)

#artificialintelligence

Country: Asia > China (0.05)

Industry:

Information Technology (0.53)
Media > Photography (0.51)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

Facebook wants machines to see the world through our eyes

#artificialintelligenceOct-14-2021, 20:05:40 GMT

For the last two years, Facebook AI Research (FAIR) has worked with 13 universities around the world to assemble the largest ever data set of first-person video--specifically to train deep-learning image-recognition models. AIs trained on the data set will be better at controlling robots that interact with people, or interpreting images from smart glasses. "Machines will be able to help us in our daily lives only if they really understand the world through our eyes," says Kristen Grauman at FAIR, who leads the project. Such tech could support people who need assistance around the home, or guide people in tasks they are learning to complete. "The video in this data set is much closer to how humans observe the world," says Michael Ryoo, a computer vision researcher at Google Brain and Stony Brook University in New York, who is not involved in Ego4D.

facebook, first-person video, participant, (4 more...)

#artificialintelligence

AI-Alerts: 2021 > 2021-10 > AAAI AI-Alert for Oct 19, 2021 (1.00)

Country:

North America > United States > New York > Suffolk County > Stony Brook (0.26)
South America > Colombia (0.06)
Europe > Italy (0.06)
(5 more...)

Industry: Information Technology > Services (0.38)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

CMU Helps Compile Largest Collection of First-Person Videos

CMU School of Computer ScienceOct-14-2021, 15:05:02 GMT

Researchers at Carnegie Mellon University helped compile and will have access to the largest collection of point-of-view videos in the world. These videos could enable artificial intelligence to understand the world from a first-person point of view and unlock a new wave of virtual assistants, augmented reality and robotics. Until now, most of the video used to train computer vision models came from the third-person point of view. The first-person, or egocentric, video included in this collection will allow researchers to train computer vision systems to see the world as humans do. "For the first time, we'll have enough data to be able to teach computers to see what we see," said Kris Kitani, an associate research professor in the Robotics Institute who led CMU's efforts to collect data.

cmu help compile largest collection, kitani, video, (8 more...)

CMU School of Computer Science

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.40)
Africa > Rwanda (0.06)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Learning Robot Activities from First-Person Human Videos Using Convolutional Future Regression

Lee, Jangwon, Ryoo, Michael S.

arXiv.org Artificial IntelligenceJul-24-2017

We design a new approach that allows robot learning of new activities from unlabeled human example videos. Given videos of humans executing the same activity from a human's viewpoint (i.e., first-person videos), our objective is to make the robot learn the temporal structure of the activity as its future regression network, and learn to transfer such model for its own motor execution. We present a new deep learning model: We extend the state-of-the-art convolutional object detection network for the representation/estimation of human hands in training videos, and newly introduce the concept of using a fully convolutional network to regress (i.e., predict) the intermediate scene representation corresponding to the future frame (e.g., 1-2 seconds later). Combining these allows direct prediction of future locations of human hands and objects, which enables the robot to infer the motor control plan using our manipulation network. We experimentally confirm that our approach makes learning of robot activities from unlabeled human interaction videos possible, and demonstrate that our robot is able to execute the learned collaborative activities in real-time directly based on its camera input.

artificial intelligence, machine learning, robot, (18 more...)

arXiv.org Artificial Intelligence

1703.0104

Country: North America > United States > Indiana > Monroe County > Bloomington (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback