UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking

Wang, He, Xu, Tianyang, Tang, Zhangyong, Wu, Xiao-Jun, Kittler, Josef

Feb-25-2025–arXiv.org Artificial Intelligence

--Multi-modal tracking is essential in single-object tracking (SOT), as different sensor types contribute unique capabilities to overcome challenges caused by variations in object appearance. However, existing unified RGB-X trackers (X represents depth, event, or thermal modality) either rely on the task-specific training strategy for individual RGB-X image pairs or fail to address the critical importance of modality-adaptive perception in real-world applications. In this work, we propose UASTrack, a unified adaptive selection framework that facilitates both model and parameter unification, as well as adaptive modality discrimination across various multi-modal tracking tasks. T o achieve modality-adaptive perception in joint RGB-X pairs, we design a Discriminative Auto-Selector (DAS) capable of identifying modality labels, thereby distinguishing the data distributions of auxiliary modalities. Furthermore, we propose a T ask-Customized Optimization Adapter (TCOA) tailored to various modalities in the latent space. This strategy effectively filters noise redundancy and mitigates background interference based on the specific characteristics of each modality. Extensive comparisons conducted on five benchmarks including LasHeR, GTOT, RGBT234, VisEvent, and DepthTrack, covering RGB-T, RGB-E, and RGB-D tracking scenarios, demonstrate our innovative approach achieves comparative performance by introducing only additional training parameters of 1.87M and flops of 1.95G. The code will be available at https://github.com/wanghe/UASTrack. Index T erms --Multi-modal object tracking, Unified multi-modal tracking tasks, Adaptive task recognition. Isual object tracking [1]-[4] is a crucial research area in computer vision, focusing on estimating the position and size of an object throughout a video sequence, beginning with the object initial state in the first frame. Recent advancements highlight the limitations of relying solely on visible sensors, leading to increased interest in utilizing auxiliary modalities such as thermal (T) [5], event (E) [6], and depth (D) [7]. He Wang, Tianyang Xu, Zhangyong Tang, Shaochuan Zhao, and Xiao-Jun Wu (Corresponding author) are with the School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China (email: 7243115005@stu.jiangnan.edu.cn;

artificial intelligence, machine learning, modality, (17 more...)

arXiv.org Artificial Intelligence

Feb-25-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.24)

Genre:
- Research Report
  - New Finding (0.68)
  - Promising Solution (0.48)

Industry:
- Education (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Representation & Reasoning (0.68)
  - Vision (0.67)