Video OWL-ViT: Temporally-consistent open-world localization in video