iVideoGPT: Interactive VideoGPTs are Scalable World Models

Wu, Jialong, Yin, Shaofeng, Feng, Ningya, He, Xu, Li, Dong, Hao, Jianye, Long, Mingsheng

Jun-2-2024–arXiv.org Artificial Intelligence

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications.

ivideogpt, prediction, world model, (14 more...)

arXiv.org Artificial Intelligence

Jun-2-2024

arXiv.org PDF

Add feedback

Country:
- Asia
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture
      - Tokyo (0.04)
    - Chūbu > Ishikawa Prefecture
      - Kanazawa (0.04)
  - China > Tianjin Province
    - Tianjin (0.04)

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Representation & Reasoning (1.00)
  - Cognitive Science > Problem Solving (1.00)
  - Natural Language > Large Language Model (0.94)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks > Deep Learning (1.00)
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found