narration
- North America > United States > California (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Bangladesh (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Leisure & Entertainment (0.67)
- Law (0.67)
- Information Technology > Security & Privacy (0.45)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- North America > Canada > British Columbia (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
31fb284a0aaaad837d2930a610cd5e50-Supplemental-Conference.pdf
In our work, we study the video-language pretraining in a specific yet significant domain - the 1st-person view,which ismotivated bytherelease oftheEgo4D dataset. Thevarying clipfrequencies aremainly dependent on manual narrations that are annotated based on the video scenarios and activities. There have average 13.4 clips per minute of video, maximize to175.8 Fig.6(b)displays the distribution of clip duration. In Figure 1 (c), we present the distribution of narration words length.
- South America > Colombia (0.05)
- North America > United States > Minnesota (0.05)
- North America > United States > Indiana (0.05)
- (5 more...)
- Asia > Singapore (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
COBE: Contextualized Object Embeddings from Narrated Instructional Video
Many objects in the real world undergo dramatic variations in visual appearance. For example, a tomato may be red or green, sliced or chopped, fresh or fried, liquid or solid. Training a single detector to accurately recognize tomatoes in all these different states is challenging. On the other hand, contextual cues (e.g., the presence of a knife, a cutting board, a strainer or a pan) are often strongly indicative of how the object appears in the scene. Recognizing such contextual cues is useful not only to improve the accuracy of object detection or to determine the state of the object, but also to understand its functional properties and to infer ongoing or upcoming human-object interactions.
- Education > Educational Technology > Media (0.45)
- Education > Educational Technology > Audio & Video (0.45)
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos
We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter-and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.
- Education > Educational Technology > Media (0.66)
- Education > Educational Technology > Audio & Video (0.66)
Amazon pulls AI recap from Fallout TV show after it made several mistakes
Amazon has pulled a video recap made with artificial intelligence (AI) from its hit TV show Fallout after users said it got several facts wrong about the series. The firm said in November it was testing the first-of-its-kind tool in the US to help viewers catch up on some of its shows on streaming service Prime Video - including Fallout, its adaptation of the popular video game franchise. But it has since disappeared from the site after users highlighted mistakes in its video summarising the events of Fallout season one - including claiming one scene was set more than 100 years earlier than it was. The BBC has approached Amazon for comment. The move to apparently press pause on its AI-powered recaps was first reported by tech publication The Verge .
- North America > Central America (0.15)
- Oceania > Australia (0.06)
- Europe > United Kingdom > Wales (0.06)
- (15 more...)
- Leisure & Entertainment (1.00)
- Media > Television (0.92)
Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations
Messina, Nicola, Leonardi, Rosario, Ciampi, Luca, Carrara, Fabio, Farinella, Giovanni Maria, Falchi, Fabrizio, Furnari, Antonino
Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations $\unicode{x2013}$ natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations in a weakly-supervised regime. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations. Code and data can be found at https://fpv-iplab.github.io/WISH.
- North America > United States (0.05)
- Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)