AITopics | narration

90ce332aff156b910b002ce4e6880dec-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsApr-29-2026, 00:35:40 GMT

data mining, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Genre: Research Report (0.67)

Industry:

Leisure & Entertainment (0.67)
Law (0.67)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

b689b90ddf2b47d3103decabe6d47446-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsMar-14-2026, 03:30:39 GMT

Robotics,autonomousdriving,augmentedreality,andmanyembodiedcomputer vision applications must quickly react to user-defined events unfolding in real time.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
Europe > Italy > Tuscany > Florence (0.04)

Industry: Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Add feedback

A Diagnostic Benchmark for Very Long-form Video Language Understanding

Neural Information Processing SystemsFeb-15-2026, 21:24:53 GMT

To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Bangladesh (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.67)

Industry:

Leisure & Entertainment (0.67)
Law (0.67)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

acaa23f71f963e96c8847585e71352d6-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 19:30:53 GMT

computer vision, dataset, noun, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

792dd774336314c3c27a04bb260cf2cf-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 11:13:42 GMT

interaction, modality, representation, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

31fb284a0aaaad837d2930a610cd5e50-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 05:24:11 GMT

In our work, we study the video-language pretraining in a specific yet significant domain - the 1st-person view,which ismotivated bytherelease oftheEgo4D dataset. Thevarying clipfrequencies aremainly dependent on manual narrations that are annotated based on the video scenarios and activities. There have average 13.4 clips per minute of video, maximize to175.8 Fig.6(b)displays the distribution of clip duration. In Figure 1 (c), we present the distribution of narration words length.

artificial intelligence, egoclip, video, (17 more...)

Neural Information Processing Systems

Country:

South America > Colombia (0.05)
North America > United States > Minnesota (0.05)
North America > United States > Indiana (0.05)
(5 more...)

Technology: Information Technology > Artificial Intelligence (0.70)

Add feedback

EgocentricVideo-LanguagePretraining

Neural Information Processing SystemsFeb-8-2026, 05:24:07 GMT

As illustrated in Tab. 1, the formerly largest egocentric video dataset EPICKITCHENS-100 [14] focuses on kitchens scenarios and its size is far smaller than those of the 3rd-person pretraining sets WebVid-2M [3] and HowTo100M [10].

artificial intelligence, egoclip, video, (15 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Technology: Information Technology > Artificial Intelligence > Vision (0.36)

Add feedback

COBE: Contextualized Object Embeddings from Narrated Instructional Video

Neural Information Processing SystemsDec-24-2025, 10:39:21 GMT

Many objects in the real world undergo dramatic variations in visual appearance. For example, a tomato may be red or green, sliced or chopped, fresh or fried, liquid or solid. Training a single detector to accurately recognize tomatoes in all these different states is challenging. On the other hand, contextual cues (e.g., the presence of a knife, a cutting board, a strainer or a pan) are often strongly indicative of how the object appears in the scene. Recognizing such contextual cues is useful not only to improve the accuracy of object detection or to determine the state of the object, but also to understand its functional properties and to infer ongoing or upcoming human-object interactions.

contextualized object embedding, name change, narrated instructional video, (7 more...)

Neural Information Processing Systems

Industry:

Education > Educational Technology > Media (0.45)
Education > Educational Technology > Audio & Video (0.45)

Technology: Information Technology > Artificial Intelligence (0.76)

Add feedback

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Neural Information Processing SystemsDec-24-2025, 08:21:46 GMT

We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter-and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.

instructional video, name change, self-supervised spatial grounding, (7 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.30)

Industry:

Education > Educational Technology > Media (0.66)
Education > Educational Technology > Audio & Video (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Amazon pulls AI recap from Fallout TV show after it made several mistakes

BBC NewsDec-12-2025, 18:04:47 GMT

Amazon has pulled a video recap made with artificial intelligence (AI) from its hit TV show Fallout after users said it got several facts wrong about the series. The firm said in November it was testing the first-of-its-kind tool in the US to help viewers catch up on some of its shows on streaming service Prime Video - including Fallout, its adaptation of the popular video game franchise. But it has since disappeared from the site after users highlighted mistakes in its video summarising the events of Fallout season one - including claiming one scene was set more than 100 years earlier than it was. The BBC has approached Amazon for comment. The move to apparently press pause on its AI-powered recaps was first reported by tech publication The Verge .

artificial intelligence, machine learning, natural language, (11 more...)

BBC News

Country: