ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving
Liu, Xueyi, Zhong, Zuodong, Guo, Yuxin, Liu, Yun-Fu, Su, Zhiguo, Zhang, Qichao, Wang, Junli, Gao, Yinfeng, Zheng, Yupeng, Lin, Qiao, Chen, Huiyong, Zhao, Dongbin
–arXiv.org Artificial Intelligence
Recently, end-to-end (E2E) autonomous driving presents a scalable, data-driven paradigm that has garnered increasing attention [1, 2, 3]. Despite its advantages in simplifying the driving pipeline, most existing E2E approaches rely on imitation learning [4, 5] and exhibit limitations in complex, closed-loop environments. Specifically, they often suffer from causal confusion during interactive cases [6] and struggle to generalize to out-of-distribution scenarios [7]. Recent progress in mul-timodal large language models (MLLMs) [8, 9, 10] enables vision-language reasoning [11] and zero-shot generalization [12] capabilities, offering new opportunities for E2E autonomous driving. Recent efforts have explored dual-system frameworks [13, 14, 15], LLM distillation for enhancing E2E driving [16, 17], and direct trajectory prediction in textual form [18, 19, 20]. While promising, these approaches predominantly operate in open-loop settings or exhibit suboptimal performance in closed-loop evaluations. This limitation stems from their inability to perform context-aware reasoning and robust planning in closed-loop scenarios, where continuous adaptation to dynamic environments is essential [21]. We conclude three key challenges that limit the full exploitation of MLLMs'
arXiv.org Artificial Intelligence
Sep-23-2025