From Masks to Worlds: A Hitchhiker's Guide to World Models
Bai, Jinbin, Lei, Yu, Wu, Hecong, Zhu, Yuchen, Li, Shufan, Xin, Yi, Li, Xiangtai, Tao, Molei, Grover, Aditya, Yang, Ming-Hsuan
–arXiv.org Artificial Intelligence
This is not a typical survey of world models; it is a guide for those who want to build worlds. We do not aim to catalog every paper that has ever mentioned a "world model". Instead, we follow one clear road: from early masked models that unified representation learning across modalities, to unified architectures that share a single paradigm, then to interactive generative models that close the action-perception loop, and finally to memory-augmented systems that sustain consistent worlds over time. We bypass loosely related branches to focus on the core: the generative heart, the interactive loop, and the memory system. We show that this is the most promising path towards true world models. The term world model has been used to describe many different ideas: learned environment simulators for reinforcement learning (Ha & Schmidhuber, 2018; Hafner et al., 2019), agents that integrate learned models with planning (Schrittwieser et al., 2020), and large language models that simulate entire societies (Park et al., 2023). Y et despite hundreds of related works, there is no clear consensus on how to actually build a true world model. In this paper, we take a stance: the path is much narrower than it appears. A true world model is not a monolithic entity but a system synthesized from three core subsystems: a generative heart that produces world states, an interactive loop that closes the action-perception cycle in real time, and a persistent memory system that sustains coherence over long horizons. The history of the field can be understood as an evolutionary journey from first mastering these components in isolation to now integrating them. Most works focus on optimizing narrow tasks and drift away from the generative, interactive, and persistent nature required for a true world model. To make this perspective concrete, we chart the historical evolution of world models as a sequence of five stages, shown in Figure 1. It begins with Stage I: Mask-based Models, which established a universal, token-based pretraining paradigm across modalities.
arXiv.org Artificial Intelligence
Oct-24-2025
- Country:
- Asia
- China > Guangxi Province
- Nanning (0.04)
- Middle East
- Jordan (0.04)
- Saudi Arabia > Northern Borders Province
- Arar (0.04)
- China > Guangxi Province
- Asia
- Genre:
- Research Report (0.40)
- Industry:
- Leisure & Entertainment > Games > Computer Games (0.93)
- Technology: