Video Instruction Tuning With Synthetic Data
Zhang, Yuanhan, Wu, Jinming, Li, Wei, Li, Bo, Ma, Zejun, Liu, Ziwei, Li, Chunyuan
–arXiv.org Artificial Intelligence
These sources offer a wide range of video data from different websites, viewpoints, and domains. The relationship between these ten selected video datasets and others is shown in Figure 1. The videos from this ten datsets build the video pool for the further video selection. Notably, we use untrimmed videos from each source except for YouCook2 and Kinetics-700. We believe that cutting videos into clips can break the plot continuity, which is essential for understanding the videos. Based on the video pool, we aim to select dynamic videos. In Figure 1, we outline our criteria for selecting high-quality data. Our main method for identifying dynamic content involves using PySceneDetect, which calculates the number of scenes in a video We found that the number of scenes is a good indicator of video dynamism. Additionally, we have designed a specific approach to exclude videos that mainly contain "slides."
arXiv.org Artificial Intelligence
Oct-4-2024