LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
–Neural Information Processing Systems
We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision.
Neural Information Processing Systems
Jun-23-2026, 00:43:31 GMT