Dual-Stream Transformer for Generic Event Boundary Captioning

Gu, Xin, Ye, Hanhua, Chen, Guang, Wang, Yufei, Zhang, Libo, Wen, Longyin

arXiv.org Artificial Intelligence 

GEBC requires the captioning model to have a Faster-RCNN [9] is utilized to extract region of interest of comprehension of instantaneous status changes around the given videos. Additionally, we utilize the "types of boundary" given video boundary, which makes it much more challenging labels as the language-modality input to help the model than conventional video captioning task. In this paper, a generate more accurate descriptions for boundaries. Dual-Stream Transformer with improvements on both video In order to learn discriminative representations for video content encoding and captions generation is proposed: (1) boundaries, the extracted multi-modal features are input We utilize three pre-trained models to extract the video features into our especially designed Dual-Stream Transformer.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found