Multimodal Keyless Attention Fusion for Video Classification

Long, Xiang (Tsinghua University) | Gan, Chuang (Tsinghua University) | Melo, Gerard de (Rutgers University) | Liu, Xiao (Baidu) | Li, Yandong (Baidu) | Li, Fu (Baidu) | Wen, Shilei (Baidu)

AAAI Conferences 

The problem of video classification is inherently sequential and multimodal, and deep neural models hence need to capture and aggregate the most pertinent signals for a given input video. We propose Keyless Attention as an elegant and efficient means to more effectively account for the sequential nature of the data. Moreover, comparing a variety of multimodal fusion methods, we find that Multimodal Keyless Attention Fusion is the most successful at discerning interactions between modalities. We experiment on four highly heterogeneous datasets, UCF101, ActivityNet, Kinetics, and YouTube-8M to validate our conclusion, and show that our approach achieves highly competitive results. Especially on large-scale data, our method has great advantages in efficiency and performance. Most remarkably, our best single model can achieve 77.0% in terms of the top-1 accuracy and 93.2% in terms of the top-5 accuracy on the Kinetics validation set, and achieve 82.2% in terms of GAP@20 on the official YouTube-8M test set.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found