Gong, Sitong
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
Gong, Sitong, Zhuge, Yunzhi, Zhang, Lu, Yang, Zongxin, Zhang, Pingping, Lu, Huchuan
Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical tokens.Our key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level