COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Open in new window