Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving

Yang, Sheng, Zhan, Tong, Chen, Guancheng, Lu, Yanfeng, Wang, Jian

Oct-6-2025–arXiv.org Artificial Intelligence

In this work, we reconceptualize autonomous driving as a generalized language and formulate the trajectory planning task as next waypoint prediction. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the VLM (Vision-Language Model) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to master complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves the state-of-the-art performance on the nuScenes dataset, delivers an overall improvement of over 30% compared to prior baselines. Furthermore, it exhibits superior generalization performance on cross-domain datasets acquired from diverse vehicles, demonstrating notable potential for cross-vehicle robustness and adaptability. Due to these empirical strengths, this work introduces a model enabling fundamental driving behaviors, laying the foundation for the development of more capable self-driving agents. Code will be available upon publication.Figure 1: Visualization of typical driving scenarios. Predicted trajectories and ego vehicle coverage are shown in green, whereas ground truth trajectories are displayed in orange. Human driving is an inherently sequential decision-making process, in which each action is conditioned on a real-time understanding of the surrounding scene. This dynamic interplay of perception and action exhibits strong similarities to natural language generation, which also involves producing a highly correlated sequence of outputs. Viewing the driving task from this perspective allows us to frame a Vision-Language Model (VLM) as a powerful policy network.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Oct-6-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China
    - Beijing > Beijing (0.04)
    - Shanghai > Shanghai (0.04)
  - Singapore (0.04)
- Europe > Netherlands
  - South Holland > Delft (0.05)
- North America > United States (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Transportation > Ground > Road (0.51)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language > Large Language Model (1.00)
  - Robots > Autonomous Vehicles (0.89)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found