Appendices for WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark

Neural Information Processing Systems 

Due to space concerns, many details have been omitted in the main text. Here, we present more details about our dataset and method, as well as discussions and experimental results, as follows. We present the potential social impacts of our work. We discuss the limitations of our work. We offer more statistical results and dataset splits of WebUOT-1M. We present the definitions and distributions of 23 tracking attributes. We present more details about the proposed MATP module. We perform extensive discussions and analyses on the sample imbalance, inference settings, the role of open-air domain knowledge, the advantages of the OKTrack method, and the differences between our work and existing works. We demonstrate the error ranges, results on UVOT400, and attribute-based performance on WebUOT-1M. The proposed WebUOT-1M dataset can promote the research of UOT, which is beneficial for underwater vision understanding, marine environmental monitoring, marine animal conservation, etc. Despite our best efforts to collect as many target categories as possible, due to the vast diversity of underwater targets in the real world, we still need to be careful about whether models trained on WebUOT-1M can generalize well to unseen rare underwater targets. The constructed WebUOT-1M dataset under Creative Commons licenses 4.0 One limitation of the proposed method is that it relies on the ViT backbone, which is inherently constrained by the quadratic computational complexity of the self-attention mechanism [13].