DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

Huang, Zhijian, Feng, Chengjian, Yan, Feng, Xiao, Baihui, Jie, Zequn, Zhong, Yujie, Liang, Xiaodan, Ma, Lin

Dec-13-2024–arXiv.org Artificial Intelligence

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Dec-13-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Automobiles & Trucks (1.00)
- Transportation > Ground
  - Road (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)
  - Natural Language > Large Language Model (1.00)
  - Robots > Autonomous Vehicles (1.00)