NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Luo, Run, Xia, Xiaobo, Wang, Lu, Chen, Longze, Shan, Renke, Luo, Jing, Yang, Min, Chua, Tat-Seng

Oct-17-2025–arXiv.org Artificial Intelligence

Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

arxiv preprint arxiv, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Oct-17-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (0.94)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Cognitive Science (1.00)
    - Speech > Speech Recognition (0.67)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found