Temporal Triplane Transformers as Occupancy World Models

Xu, Haoran, Peng, Peixi, Tan, Guang, Chang, Yiqian, Zhao, Yisen, Tian, Yonghong

arXiv.org Artificial Intelligence 

World models [1, 2] are designed to predict future scenes and facilitate motion planning for agents. These models first construct lower-dimensional representations of the scenes, which serve as a foundation for learning the patterns of environmental dynamics. This capability supports the identification of potential dangers, the determination of traffic participants' intentions, and ultimately leads to improved decision-making. This paper focuses on world models for autonomous driving [3, 4, 5, 6, 7], where accurately predicting the future behavior of traffic participants is essential for the agent's planning. Existing methods [8, 6, 7, 9] mainly provide instance-level predictions for traffic participants from a Bird's Eye View (BEV) perspective, or directly utilize diffusion models [10, 11, 12, 13, 14] to generate future pixel-level driving views. However, these methods have difficulty in establishing fine-grained, 3D associations between changes in the scene and the agent's motion planning. Recent advancements in 3D occupancy technologies [15, 16, 17, 18, 19] have gained significant attention from both academia and industry [20, 21].