LAP: Fast LAtent Diffusion Planner with Fine-Grained Feature Distillation for Autonomous Driving

Zhang, Jinhao, Xia, Wenlong, Zhou, Zhexuan, Gong, Youmin, Mei, Jie

Dec-3-2025–arXiv.org Artificial Intelligence

Diffusion models have demonstrated strong capabilities for modeling humanlike driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a V AE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. We further introduce a fine-grained feature distillation mechanism to guide a better interaction and fusion between the high-level semantic planning space and the vectorized scene context. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speedup of at most 10 over previous SOT A approaches. A central challenge is handling the inherent uncertainty and behavioral multimodality of real-world traffic, where multiple distinct yet equally plausible maneuvers may be available (Y ang et al., 2023; Xiao et al., 2020). While early rule-based systems offered interpretability, their hand-crafted logic is brittle and fails to scale to the long-tail of open-world scenarios (Fan et al., 2018; Chen et al., 2024). Consequently, the field has shifted towards data-driven Imitation Learning (IL), which excels at capturing nuanced, human-like behaviors from large-scale datasets (Le Mero et al., 2022; Teng et al., 2022). However, the standard IL objective is notoriously susceptible to mode-averaging, where the model collapses multiple valid expert trajectories into a single, physically infeasible path, fundamentally failing to represent the multi-modal nature of human decision-making (Strohbeck et al., 2020). To overcome this limitation, Denoising Diffusion Probabilistic Models(DDPMs) have emerged as a powerful tool for modeling complex, multi-modal distributions (Liao et al., 2025; Ho et al., 2020). However, existing approaches models directly to raw trajectory waypoints are both computationally inefficient and conceptually flawed. This mirrors the core challenge of early image synthesis: operating in a high-dimensional pixel space expends vast model capacity on low-level details over high-level semantics (Rombach et al., 2022).

artificial intelligence, machine learning, trajectory, (18 more...)

arXiv.org Artificial Intelligence

Dec-3-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.67)

Industry:
- Information Technology > Robotics & Automation (0.62)
- Transportation > Ground
  - Road (0.62)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Robots > Autonomous Vehicles (1.00)
  - Representation & Reasoning > Rule-Based Reasoning (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found