Video OWL-ViT: Temporally-consistent open-world localization in video
Heigold, Georg, Minderer, Matthias, Gritsenko, Alexey, Bewley, Alex, Keysers, Daniel, Lučić, Mario, Yu, Fisher, Kipf, Thomas
–arXiv.org Artificial Intelligence
We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.
arXiv.org Artificial Intelligence
Aug-21-2023
- Genre:
- Research Report (0.82)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks (0.47)
- Statistical Learning (0.46)
- Natural Language (1.00)
- Representation & Reasoning (1.00)
- Vision > Video Understanding (0.48)
- Machine Learning
- Information Technology > Artificial Intelligence