iVideoGPT: Interactive VideoGPTs are Scalable World Models
Wu, Jialong, Yin, Shaofeng, Feng, Ningya, He, Xu, Li, Dong, Hao, Jianye, Long, Mingsheng
–arXiv.org Artificial Intelligence
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications.
arXiv.org Artificial Intelligence
Jun-2-2024
- Country:
- Asia
- China > Tianjin Province
- Tianjin (0.04)
- Japan > Honshū
- Chūbu > Ishikawa Prefecture
- Kanazawa (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.04)
- Chūbu > Ishikawa Prefecture
- China > Tianjin Province
- Asia
- Genre:
- Research Report > Promising Solution (0.34)
- Industry:
- Education (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science > Problem Solving (1.00)
- Machine Learning
- Learning Graphical Models > Undirected Networks
- Markov Models (0.46)
- Neural Networks > Deep Learning (1.00)
- Reinforcement Learning (1.00)
- Learning Graphical Models > Undirected Networks
- Natural Language > Large Language Model (0.94)
- Representation & Reasoning (1.00)
- Robots (1.00)
- Information Technology > Artificial Intelligence