GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization

Huang, Jia-Hong, Murn, Luka, Mrak, Marta, Worring, Marcel

Apr-26-2021–arXiv.org Artificial Intelligence

Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users' expectations in content search and exploration scenarios. Multi-modal video summarization is one of the methods utilized to address this problem. When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. Thus, encoding the text-based query and the video effectively are both important for the task of multi-modal video summarization. In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. Based on the evaluation of the existing multi-modal video summarization benchmark, experimental results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.

representation, summarization, video summarization, (14 more...)

arXiv.org Artificial Intelligence

Apr-26-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - New Mexico > Bernalillo County
    - Albuquerque (0.04)
- Europe
  - United Kingdom > England
    - Greater London > London (0.04)
  - Netherlands > North Holland
    - Amsterdam (0.04)
- Asia
  - China (0.04)
  - Taiwan > Taiwan Province
    - Taipei (0.04)

Genre:
- Research Report
  - New Finding (0.34)
  - Promising Solution (0.34)

Industry:
- Leisure & Entertainment > Sports (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.85)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found