Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Chen, Jun, Zhu, Deyao, Haydarov, Kilichbek, Li, Xiang, Elhoseiny, Mohamed

arXiv.org Artificial Intelligence 

Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment. Although there have been recent advances, generating detailed and enriched video descriptions continues to be a substantial challenge. In this work, we introduce Video ChatCaptioner, an approach for creating more comprehensive spatiotemporal video descriptions. Our method employs a ChatGPT model as a controller, specifically designed to select frames for posing video content-driven questions. Subsequently, BLIP-2 is utilized to answer these visual queries. This question-answer framework effectively uncovers intricate video details and shows promise as a method for enhancing video content. Following multiple conversational rounds, ChatGPT can summarize enriched video content based on previous conversations. Through the human evaluation experiments, we found that nearly 62.5% of participants agree that Video ChatCaptioner can cover more visual information compared to ground-truth captions.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found