Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

Xu, Mengyao, Zhou, Wenfei, Babakhin, Yauhen, Moreira, Gabriel, Ak, Ronay, Osmulski, Radek, Liu, Bo, Oldridge, Even, Schifferer, Benedikt

Oct-7-2025–arXiv.org Artificial Intelligence

We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-7-2025

arXiv.org PDF

Add feedback

Country:
- Europe
  - Czechia > Prague (0.04)
  - Germany > Berlin (0.04)
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
  - Slovenia > Drava
    - Municipality of Benedikt > Benedikt (0.05)
- North America
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
  - United States
    - California > Los Angeles County
      - Los Angeles (0.04)
    - Florida > Sarasota County
      - Sarasota (0.04)
    - New York (0.04)
- Oceania > Australia
  - Queensland > Brisbane (0.04)
- South America > Brazil
  - São Paulo (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found