SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

Hu, Yangliu, Song, Zikai, Feng, Na, Luo, Yawei, Yu, Junqing, Chen, Yi-Ping Phoebe, Yang, Wei

Apr-11-2025–arXiv.org Artificial Intelligence

Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. T o tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions: (1) Self-Supervised Fragment Fine-Tuning (SF 2 T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. W e assessed multiple models and validated the effectiveness of SF 2 T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details. 1. Introduction Large Language Models (LLMs) have showcased significant emergent capabilities, such as in-context learning [19], instruction-following [23], and chain-of-thought reasoning [30], driven by expansive datasets and advanced model architectures. Performance w/ and w/o SF 2 T .We evaluated four advanced Video-LLMs w/ and w/o SF 2 T on our proposed FineV - idBench with two baselines: (1) Base: performance without any fine-tuning (blue dashed), and (2) Base (SFT): performance with supervised fine-tuning (red dashed). After applying SF 2 T, all models showed significant improvements (solid blue and red), underscoring its broad effectiveness. V arious Video-LLMs, exemplified by GPT4-V, VideoL-LaMA 2 [4], MiniCPM-V [34], and Qwen2-VL [28], have been crafted by leading corporations and research institutions, demonstrating proficiency in capturing the overar-ching content of videos.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Apr-11-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)

Genre:
- Research Report > New Finding (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found