VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection
Wang, Qiang, Gao, Xinyuan, Dong, SongLin, Han, Jizhou, Li, Jiangyang, He, Yuhang, Gong, Yihong
–arXiv.org Artificial Intelligence
W e present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. W e convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs.
arXiv.org Artificial Intelligence
Nov-25-2025