Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Chen, Jun, Zhu, Deyao, Haydarov, Kilichbek, Li, Xiang, Elhoseiny, Mohamed

May-24-2023–arXiv.org Artificial Intelligence

Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment. Although there have been recent advances, generating detailed and enriched video descriptions continues to be a substantial challenge. In this work, we introduce Video ChatCaptioner, an approach for creating more comprehensive spatiotemporal video descriptions. Our method employs a ChatGPT model as a controller, specifically designed to select frames for posing video content-driven questions. Subsequently, BLIP-2 is utilized to answer these visual queries. This question-answer framework effectively uncovers intricate video details and shows promise as a method for enhancing video content. Following multiple conversational rounds, ChatGPT can summarize enriched video content based on previous conversations. Through the human evaluation experiments, we found that nearly 62.5% of participants agree that Video ChatCaptioner can cover more visual information compared to ground-truth captions.

large language model, machine learning, video chat captioner, (15 more...)

arXiv.org Artificial Intelligence

May-24-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Industry:
- Leisure & Entertainment (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found