Dual-Stream Transformer for Generic Event Boundary Captioning
Gu, Xin, Ye, Hanhua, Chen, Guang, Wang, Yufei, Zhang, Libo, Wen, Longyin
–arXiv.org Artificial Intelligence
GEBC requires the captioning model to have a Faster-RCNN [9] is utilized to extract region of interest of comprehension of instantaneous status changes around the given videos. Additionally, we utilize the "types of boundary" given video boundary, which makes it much more challenging labels as the language-modality input to help the model than conventional video captioning task. In this paper, a generate more accurate descriptions for boundaries. Dual-Stream Transformer with improvements on both video In order to learn discriminative representations for video content encoding and captions generation is proposed: (1) boundaries, the extracted multi-modal features are input We utilize three pre-trained models to extract the video features into our especially designed Dual-Stream Transformer.
arXiv.org Artificial Intelligence
Mar-24-2023
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Vision (0.95)
- Information Technology > Artificial Intelligence