Training Video Foundation Models with NVIDIA NeMo

Patel, Zeeshan, He, Ethan, Mannan, Parth, Ren, Xiaowei, Wolf, Ryan, Agarwal, Niket, Huffman, Jacob, Wang, Zhuoyao, Wang, Carl, Chang, Jack, Bai, Yan, Huang, Tommy, Wang, Linnan, Jain, Sahil, Ramasamy, Shanmugam, Jennings, Joseph, Sirazitdinova, Ekaterina, Sudakov, Oleg, Ma, Mingyuan, Chen, Bobby, Lin, Forrest, Wang, Hao, Sabavat, Vasanth Rao Naik, Niverty, Sriharsha, Ou, Rong, Bhattacharya, Pallab, Page, David, Tajbakhsh, Nima, Aithal, Ashwath

Mar-17-2025–arXiv.org Artificial Intelligence

Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

machine learning, natural language, training video foundation model, (16 more...)

arXiv.org Artificial Intelligence

Mar-17-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.43)
- Workflow (0.46)

Industry:
- Information Technology > Hardware (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Natural Language (1.00)
  - Representation & Reasoning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found