AITopics | omnivl

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a \emph{decoupled} joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

foundation model, image-language and video-language task, omnivl, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.96)

Add feedback

259a5df46308d60f8454bd4adcc3b462-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 14:34:38 GMT

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

OmniVL: One Foundation Model for Image-Language

Neural Information Processing SystemsOct-3-2025, 01:06:12 GMT

Our setup is based on the following considerations. The default settings for finetuning on each dataset are shown in Table 1. Table 1: End-to-end finetuning configurations for image-language downstream tasks.Config COCO (retrieval) & Flickr30k COCO (captioning) VQA optimizer AdamW AdamW AdamW base learning rate 1e-5 1e-5 2e-5 weight decay 0.05 0.05 0.05 learning rate schedule linear decay linear decay linear decay batch size 512 512 256 training epochs 10 10 10 C.2 Video-Language T asks We demonstrate more comparison results using different pretraining paradigms ( i.e., image-only, Details of the pretraining data can be found in Table 4. "img2vid" strategy is also adopted for further comparison, where we start with image-only pretraining We can see that the captions generated by OmniVL are both natural and abundant. OmniVL can generate more fine-grained descriptions (line 1). Figure 4: Some video captions generated by OmniVL.

artificial intelligence, machine learning, omnivl, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.90)

Add feedback

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

Neural Information Processing SystemsOct-10-2024, 09:14:55 GMT

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a \emph{decoupled} joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible.

foundation model, image-language and video-language task, omnivl, (3 more...)

Neural Information Processing Systems

Technology: