Tao, Wenbing
Unhackable Temporal Rewarding for Scalable Video MLLMs
Yu, En, Lin, Kangheng, Zhao, Liang, Wei, Yana, Zhu, Zining, Wei, Haoran, Sun, Jianjian, Ge, Zheng, Zhang, Xiangyu, Wang, Jingyu, Tao, Wenbing
In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.
Cross-View Referring Multi-Object Tracking
Chen, Sijia, Yu, En, Tao, Wenbing
Referring Multi-Object Tracking (RMOT) is an important topic in the current tracking field. Its task form is to guide the tracker to track objects that match the language description. Current research mainly focuses on referring multi-object tracking under single-view, which refers to a view sequence or multiple unrelated view sequences. However, in the single-view, some appearances of objects are easily invisible, resulting in incorrect matching of objects with the language description. In this work, we propose a new task, called Cross-view Referring Multi-Object Tracking (CRMOT). It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task. CRMOT is a more challenging task of accurately tracking the objects that match the language description and maintaining the identity consistency of objects in each cross-view. To advance CRMOT task, we construct a cross-view referring multi-object tracking benchmark based on CAMPUS and DIVOTrack datasets, named CRTrack. Specifically, it provides 13 different scenes and 221 language descriptions. Furthermore, we propose an end-to-end cross-view referring multi-object tracking method, named CRTracker. Extensive experiments on the CRTrack benchmark verify the effectiveness of our method. The dataset and code are available at https://github.com/chen-si-jia/CRMOT.
PG-NeuS: Robust and Efficient Point Guidance for Multi-View Neural Surface Reconstruction
Zhang, Chen, Su, Wanjuan, Xu, Qingshan, Tao, Wenbing
Recently, learning multi-view neural surface reconstruction with the supervision of point clouds or depth maps has been a promising way. However, due to the underutilization of prior information, current methods still struggle with the challenges of limited accuracy and excessive time complexity. In addition, prior data perturbation is also an important but rarely considered issue. To address these challenges, we propose a novel point-guided method named PG-NeuS, which achieves accurate and efficient reconstruction while robustly coping with point noise. Specifically, aleatoric uncertainty of the point cloud is modeled to capture the distribution of noise, leading to noise robustness. Furthermore, a Neural Projection module connecting points and images is proposed to add geometric constraints to implicit surface, achieving precise point guidance. To better compensate for geometric bias between volume rendering and point modeling, high-fidelity points are filtered into a Bias Network to further improve details representation. Benefiting from the effective point guidance, even with a lightweight network, the proposed PG-NeuS achieves fast convergence with an impressive 11x speedup compared to NeuS. Extensive experiments show that our method yields high-quality surfaces with high efficiency, especially for fine-grained details and smooth regions, outperforming the state-of-the-art methods. Moreover, it exhibits strong robustness to noisy data and sparse data.