LocCa: Visual Pretraining with Location-aware Captioners Bo Wan 1,3 Michael Tschannen 1 Y ongqin Xian

Neural Information Processing Systems 

Specifically, LocCa employs two tasks, bounding box prediction and location-dependent captioning, conditioned on the image pixel input.