Video Instruction Tuning With Synthetic Data

Zhang, Yuanhan, Wu, Jinming, Li, Wei, Li, Bo, Ma, Zejun, Liu, Ziwei, Li, Chunyuan

Oct-4-2024–arXiv.org Artificial Intelligence

These sources offer a wide range of video data from different websites, viewpoints, and domains. The relationship between these ten selected video datasets and others is shown in Figure 1. The videos from this ten datsets build the video pool for the further video selection. Notably, we use untrimmed videos from each source except for YouCook2 and Kinetics-700. We believe that cutting videos into clips can break the plot continuity, which is essential for understanding the videos. Based on the video pool, we aim to select dynamic videos. In Figure 1, we outline our criteria for selecting high-quality data. Our main method for identifying dynamic content involves using PySceneDetect, which calculates the number of scenes in a video We found that the number of scenes is a good indicator of video dynamism. Additionally, we have designed a specific approach to exclude videos that mainly contain "slides."

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-4-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.67)

Industry:
- Leisure & Entertainment (0.92)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (0.95)
    - Large Language Model (1.00)
  - Vision (1.00)