AITopics | moment-detr

Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, Moment-DETR substantially outperforms previous methods.

highlight, name change, query, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Databases (0.82)
Information Technology > Artificial Intelligence > Natural Language (0.62)

Add feedback

Appendix for QVH IGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries

Neural Information Processing SystemsAug-14-2025, 20:18:44 GMT

In Table 2, we show the effect of using different #moment queries. As can be seen from the table, this hyper-parameter has a large impact on moment retrieval task where a reasonably smaller value (e.g., 10) gives better performance. As described in main text Equation 3, Moment-DETR's saliency loss Table 3, we study the effect of using the two terms. We show more correct predictions and failure cases from our Moment-DETR model in Figure 1 and Figure 2. In Table 4, we show the distribution of annotated saliency scores. We noticed 94.41% of the annotated clips are rated by two or more users as'Fair' or better (i.e., >=3, To ensure data quality, we require workers to pass our qualification test before participating in our annotation task.

artificial intelligence, natural language, query, (12 more...)

Neural Information Processing Systems

Country: North America > United States > North Carolina (0.05)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.42)

Add feedback

Detecting Moments and Highlights in Videos via Natural Language Queries

Neural Information Processing SystemsAug-14-2025, 20:18:40 GMT

Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States > North Carolina (0.04)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Detecting Moments and Highlights in Videos via Natural Language Queries

Neural Information Processing SystemsOct-10-2024, 18:43:06 GMT

Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.

natural language query, query, video, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.77)

Add feedback

Localizing Moments in Long Video Via Multimodal Guidance

Barrios, Wayner, Soldan, Mattia, Ceballos-Arroyo, Alberto Mario, Heilbron, Fabian Caba, Ghanem, Bernard

arXiv.org Artificial IntelligenceOct-15-2023

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2302.13372

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Colombia (0.04)

Genre: Research Report > Promising Solution (0.86)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Hou, Zhijian, Zhong, Wanjun, Ji, Lei, Gao, Difei, Yan, Kun, Chan, Wing-Kwong, Ngo, Chong-Wah, Shou, Zheng, Duan, Nan

arXiv.org Artificial IntelligenceMay-29-2023

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13% to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2209.10918

Country:

North America > Dominican Republic (0.04)
Asia > Singapore (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Lei, Jie, Berg, Tamara L., Bansal, Mohit

arXiv.org Artificial IntelligenceJul-20-2021

Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, Moment-DETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2107.09609

Country: North America > United States > North Carolina (0.04)

Genre: Research Report (0.40)

Industry: Education (0.93)

Technology: