narration
31fb284a0aaaad837d2930a610cd5e50-Supplemental-Conference.pdf
In our work, we study the video-language pretraining in a specific yet significant domain - the 1st-person view,which ismotivated bytherelease oftheEgo4D dataset. Thevarying clipfrequencies aremainly dependent on manual narrations that are annotated based on the video scenarios and activities. There have average 13.4 clips per minute of video, maximize to175.8 Fig.6(b)displays the distribution of clip duration. In Figure 1 (c), we present the distribution of narration words length.
COBE: Contextualized Object Embeddings from Narrated Instructional Video
Many objects in the real world undergo dramatic variations in visual appearance. For example, a tomato may be red or green, sliced or chopped, fresh or fried, liquid or solid. Training a single detector to accurately recognize tomatoes in all these different states is challenging. On the other hand, contextual cues (e.g., the presence of a knife, a cutting board, a strainer or a pan) are often strongly indicative of how the object appears in the scene. Recognizing such contextual cues is useful not only to improve the accuracy of object detection or to determine the state of the object, but also to understand its functional properties and to infer ongoing or upcoming human-object interactions.
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos
We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter-and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.
Amazon pulls AI recap from Fallout TV show after it made several mistakes
Amazon has pulled a video recap made with artificial intelligence (AI) from its hit TV show Fallout after users said it got several facts wrong about the series. The firm said in November it was testing the first-of-its-kind tool in the US to help viewers catch up on some of its shows on streaming service Prime Video - including Fallout, its adaptation of the popular video game franchise. But it has since disappeared from the site after users highlighted mistakes in its video summarising the events of Fallout season one - including claiming one scene was set more than 100 years earlier than it was. The BBC has approached Amazon for comment. The move to apparently press pause on its AI-powered recaps was first reported by tech publication The Verge .