Goto

Collaborating Authors

 Diao, Wenhui


RingMo-Aerial: An Aerial Remote Sensing Foundation Model With A Affine Transformation Contrastive Learning

arXiv.org Artificial Intelligence

Aerial Remote Sensing (ARS) vision tasks pose significant challenges due to the unique characteristics of their viewing angles. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes the RingMo-Aerial model, aiming to fill the gap in foundation model research in the field of ARS vision. By introducing the Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism and an affine transformation-based contrastive learning pre-training method, the model's detection capability for small targets is enhanced and optimized for the tilted viewing angles characteristic of ARS. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model's adaptability and effectiveness in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and effectiveness of RingMo-Aerial in enhancing the performance of ARS vision tasks.


SiamTHN: Siamese Target Highlight Network for Visual Tracking

arXiv.org Artificial Intelligence

Siamese network based trackers develop rapidly in the field of visual object tracking in recent years. The majority of siamese network based trackers now in use treat each channel in the feature maps generated by the backbone network equally, making the similarity response map sensitive to background influence and hence challenging to focus on the target region. Additionally, there are no structural links between the classification and regression branches in these trackers, and the two branches are optimized separately during training. Therefore, there is a misalignment between the classification and regression branches, which results in less accurate tracking results. In this paper, a Target Highlight Module is proposed to help the generated similarity response maps to be more focused on the target region. To reduce the misalignment and produce more precise tracking results, we propose a corrective loss to train the model. The two branches of the model are jointly tuned with the use of corrective loss to produce more reliable prediction results. Experiments on 5 challenging benchmark datasets reveal that the method outperforms current models in terms of performance, and runs at 38 fps, proving its effectiveness and efficiency.