A Multi-modal Global Instance Tracking Benchmark (MGIT): Better Locating Target in Complex Spatio-temporal and Causal Relationship

Neural Information Processing Systems 

It consists of 150 long video sequences with a total of 2.03 million frames,