VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

Wang, Qiang, Gao, Xinyuan, Dong, SongLin, Han, Jizhou, Li, Jiangyang, He, Yuhang, Gong, Yihong

Nov-25-2025–arXiv.org Artificial Intelligence

W e present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. W e convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs.

caption, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Nov-25-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.46)

Genre:
- Research Report (0.50)
- Overview (0.46)

Technology:
- Information Technology
  - Communications (0.93)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Representation & Reasoning (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found