Temporal Triplane Transformers as Occupancy World Models

Xu, Haoran, Peng, Peixi, Tan, Guang, Chang, Yiqian, Zhao, Yisen, Tian, Yonghong

Mar-10-2025–arXiv.org Artificial Intelligence

World models [1, 2] are designed to predict future scenes and facilitate motion planning for agents. These models first construct lower-dimensional representations of the scenes, which serve as a foundation for learning the patterns of environmental dynamics. This capability supports the identification of potential dangers, the determination of traffic participants' intentions, and ultimately leads to improved decision-making. This paper focuses on world models for autonomous driving [3, 4, 5, 6, 7], where accurately predicting the future behavior of traffic participants is essential for the agent's planning. Existing methods [8, 6, 7, 9] mainly provide instance-level predictions for traffic participants from a Bird's Eye View (BEV) perspective, or directly utilize diffusion models [10, 11, 12, 13, 14] to generate future pixel-level driving views. However, these methods have difficulty in establishing fine-grained, 3D associations between changes in the scene and the agent's motion planning. Recent advancements in 3D occupancy technologies [15, 16, 17, 18, 19] have gained significant attention from both academia and industry [20, 21].

artificial intelligence, machine learning, representation, (15 more...)

arXiv.org Artificial Intelligence

Mar-10-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.47)

Genre:
- Research Report (0.50)

Industry:
- Transportation > Ground > Road (0.36)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (0.83)
  - Machine Learning (1.00)
  - Representation & Reasoning (1.00)
  - Robots (1.00)