Goto

Collaborating Authors

 narration


Supplementary Material ATF-CoVR Statistics and Modification Lexicon

Neural Information Processing Systems

TF-CoVR Statistics We present detailed statistics on the distribution of video counts per label in TF-CoVR, which comprises a diverse set of 306 annotated sub-actions. Both distrib video utions distrib are ution plotted for the on a F log ineGym arithmic [3] and scale F to ineDiving emphasize [6] the subsets long-tailed of TF-CoVR nature, of label frequencies. In FineGym, many labels have several hundred to over a thousand associated videos, with a gradual decline across the distribution. By contrast, FineDiving exhibits a steeper drop in video count per label, primarily due to samples, its smaller preserving dataset enough size. Ne div v ersity ertheless, to support a substantial temporal number fine-gr of ained labels composed still contain video more retrieval. A logarithmic scale is used on the y-axis to highlight the steep drop in video counts per label due to the smaller dataset size.


4d5f03fdb238255019826032ae7cc8e2-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias - when answers can be inferred from visual data alone - and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models.


REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Neural Information Processing Systems

Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot'quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore new video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation (REG) framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a multimodal retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in documentary teaser generation.


IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants

Neural Information Processing Systems

We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models.







31fb284a0aaaad837d2930a610cd5e50-Supplemental-Conference.pdf

Neural Information Processing Systems

In our work, we study the video-language pretraining in a specific yet significant domain - the 1st-person view,which ismotivated bytherelease oftheEgo4D dataset. Thevarying clipfrequencies aremainly dependent on manual narrations that are annotated based on the video scenarios and activities. There have average 13.4 clips per minute of video, maximize to175.8 Fig.6(b)displays the distribution of clip duration. In Figure 1 (c), we present the distribution of narration words length.