COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Ging, Simon, Zolfaghari, Mohammadreza, Pirsiavash, Hamed, Brox, Thomas

Nov-1-2020–arXiv.org Artificial Intelligence

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

deep learning, neural network, video, (18 more...)

arXiv.org Artificial Intelligence

Nov-1-2020

arXiv.org PDF

Add feedback

Country:
- Europe (0.67)
- North America > United States
  - Maryland > Baltimore (0.14)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)

Genre:
- Research Report (0.82)

Industry:
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis > Beverages (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.67)
  - Natural Language (1.00)
  - Representation & Reasoning (0.68)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found