GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization
Huang, Jia-Hong, Murn, Luka, Mrak, Marta, Worring, Marcel
–arXiv.org Artificial Intelligence
Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users' expectations in content search and exploration scenarios. Multi-modal video summarization is one of the methods utilized to address this problem. When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. Thus, encoding the text-based query and the video effectively are both important for the task of multi-modal video summarization. In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. Based on the evaluation of the existing multi-modal video summarization benchmark, experimental results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.
arXiv.org Artificial Intelligence
Apr-26-2021
- Country:
- Asia
- China (0.04)
- Taiwan > Taiwan Province
- Taipei (0.04)
- Europe
- Netherlands > North Holland
- Amsterdam (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- Netherlands > North Holland
- North America > United States
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- New York > New York County
- New York City (0.04)
- New Mexico > Bernalillo County
- Asia
- Genre:
- Research Report
- New Finding (0.34)
- Promising Solution (0.34)
- Research Report
- Industry:
- Leisure & Entertainment > Sports (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Natural Language
- Chatbot (0.85)
- Large Language Model (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Information Technology > Artificial Intelligence