LAP: Fast LAtent Diffusion Planner with Fine-Grained Feature Distillation for Autonomous Driving
Zhang, Jinhao, Xia, Wenlong, Zhou, Zhexuan, Gong, Youmin, Mei, Jie
–arXiv.org Artificial Intelligence
Diffusion models have demonstrated strong capabilities for modeling humanlike driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a V AE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. We further introduce a fine-grained feature distillation mechanism to guide a better interaction and fusion between the high-level semantic planning space and the vectorized scene context. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speedup of at most 10 over previous SOT A approaches. A central challenge is handling the inherent uncertainty and behavioral multimodality of real-world traffic, where multiple distinct yet equally plausible maneuvers may be available (Y ang et al., 2023; Xiao et al., 2020). While early rule-based systems offered interpretability, their hand-crafted logic is brittle and fails to scale to the long-tail of open-world scenarios (Fan et al., 2018; Chen et al., 2024). Consequently, the field has shifted towards data-driven Imitation Learning (IL), which excels at capturing nuanced, human-like behaviors from large-scale datasets (Le Mero et al., 2022; Teng et al., 2022). However, the standard IL objective is notoriously susceptible to mode-averaging, where the model collapses multiple valid expert trajectories into a single, physically infeasible path, fundamentally failing to represent the multi-modal nature of human decision-making (Strohbeck et al., 2020). To overcome this limitation, Denoising Diffusion Probabilistic Models(DDPMs) have emerged as a powerful tool for modeling complex, multi-modal distributions (Liao et al., 2025; Ho et al., 2020). However, existing approaches models directly to raw trajectory waypoints are both computationally inefficient and conceptually flawed. This mirrors the core challenge of early image synthesis: operating in a high-dimensional pixel space expends vast model capacity on low-level details over high-level semantics (Rombach et al., 2022).
arXiv.org Artificial Intelligence
Dec-3-2025
- Country:
- Asia > China
- Guangdong Province > Shenzhen (0.04)
- Heilongjiang Province > Harbin (0.04)
- Asia > China
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Transportation (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.68)
- Representation & Reasoning > Rule-Based Reasoning (1.00)
- Robots (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Information Technology > Artificial Intelligence