Goto

Collaborating Authors

 hd map




BEV-VLM: Trajectory Planning via Unified BEV Abstraction

Chen, Guancheng, Yang, Sheng, Zhan, Tong, Wang, Jian

arXiv.org Artificial Intelligence

ABSTRACT This paper introduces BEV -VLM, a novel framework for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual inputs. Unlike conventional approaches that rely solely on raw visual data such as camera images, our method utilizes highly compressed and informative BEV representations, which are generated by fusing multi-modal sensor data (e.g., camera and LiDAR) and aligning them with HD Maps. This unified BEV -HD Map format provides a geometrically consistent and rich scene description, enabling VLMs to perform accurate trajectory planning. Experimental results on the nuScenes dataset demonstrate 44.8% improvements in planning accuracy and complete collision avoidance. Our work highlights that VLMs can effectively interpret processed visual representations like BEV features, expanding their applicability beyond raw images in trajectory planning. Index T erms-- Autonomous Driving, Vision-Language Model, Multi-Modal Learning 1. INTRODUCTION In recent years, the pursuit of advanced autonomous driving (AD) has attracted extensive attention, with Vision-Language Models (VLMs) emerging as a promising pathway, owing to their inherent cognitive capabilities from pre-training that enable effective application in real-world scenarios. While existing research has demonstrated the feasibility and reliability of using VLMs for path planning by feeding visual camera images, these approaches suffer from two key limitations: they rely solely on camera data and thus lack integration with other modalities, such as LiDAR point clouds, and they fail to explore VLMs' potential for planning based on Bird's-Eye View (BEV) features. To address these gaps, this work avoids the direct use of raw visual signals (e.g., camera images) as VLM inputs.


Online Mapping for Autonomous Driving: Addressing Sensor Generalization and Dynamic Map Updates in Campus Environments

Zhang, Zihan, Ravichandran, Abhijit, Korti, Pragnya, Wang, Luobin, Christensen, Henrik I.

arXiv.org Artificial Intelligence

High-definition (HD) maps are essential for autonomous driving, providing precise information such as road boundaries, lane dividers, and crosswalks to enable safe and accurate navigation. However, traditional HD map generation is labor-intensive, expensive, and difficult to maintain in dynamic environments. To overcome these challenges, we present a real-world deployment of an online mapping system on a campus golf cart platform equipped with dual front cameras and a LiDAR sensor. Our work tackles three core challenges: (1) labeling a 3D HD map for campus environment; (2) integrating and generalizing the SemVecMap model onboard; and (3) incrementally generating and updating the predicted HD map to capture environmental changes. By fine-tuning with campus-specific data, our pipeline produces accurate map predictions and supports continual updates, demonstrating its practical value in real-world autonomous driving scenarios.


U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration

Li, Xiaofan, Xu, Zhihao, Wu, Chenming, Yang, Zhao, Zhang, Yumeng, Liu, Jiang-Jiang, Yu, Haibao, Duan, Fan, Ye, Xiaoqing, Wang, Yuan, Li, Shirui, Sun, Xun, Wan, Ji, Wang, Jun

arXiv.org Artificial Intelligence

Accurate localization using visual information is a critical yet challenging task, especially in urban environments where nearby buildings and construction sites significantly degrade GNSS (Global Navigation Satellite System) signal quality. This issue underscores the importance of visual localization techniques in scenarios where GNSS signals are unreliable. This paper proposes U-ViLAR, a novel uncertainty-aware visual localization framework designed to address these challenges while enabling adaptive localization using high-definition (HD) maps or navigation maps. Specifically, our method first extracts features from the input visual data and maps them into Bird's-Eye-View (BEV) space to enhance spatial consistency with the map input. Subsequently, we introduce: a) Perceptual Uncertainty-guided Association, which mitigates errors caused by perception uncertainty, and b) Localization Uncertainty-guided Registration, which reduces errors introduced by localization uncertainty. By effectively balancing the coarse-grained large-scale localization capability of association with the fine-grained precise localization capability of registration, our approach achieves robust and accurate localization. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple localization tasks. Furthermore, our model has undergone rigorous testing on large-scale autonomous driving fleets and has demonstrated stable performance in various challenging urban scenarios.


Beyond Features: How Dataset Design Influences Multi-Agent Trajectory Prediction Performance

Demmler, Tobias, Häringer, Jakob, Tamke, Andreas, Dang, Thao, Hegai, Alexander, Mikelsons, Lars

arXiv.org Artificial Intelligence

Accurate trajectory prediction is critical for safe autonomous navigation, yet the impact of dataset design on model performance remains understudied. This work systematically examines how feature selection, cross-dataset transfer, and geographic diversity influence trajectory prediction accuracy in multi-agent settings. We evaluate a state-of-the-art model using our novel L4 Motion Forecasting dataset based on our own data recordings in Germany and the US. This includes enhanced map and agent features. We compare our dataset to the US-centric Argoverse 2 benchmark. First, we find that incorporating supplementary map and agent features unique to our dataset, yields no measurable improvement over baseline features, demonstrating that modern architectures do not need extensive feature sets for optimal performance. The limited features of public datasets are sufficient to capture convoluted interactions without added complexity. Second, we perform cross-dataset experiments to evaluate how effective domain knowledge can be transferred between datasets. Third, we group our dataset by country and check the knowledge transfer between different driving cultures.


Using Language and Road Manuals to Inform Map Reconstruction for Autonomous Driving

Tumu, Akshar, Christensen, Henrik I., Vazquez-Chanlatte, Marcell, Tsuchiya, Chikao, Bhanderi, Dhaval

arXiv.org Artificial Intelligence

Lane-topology prediction is a critical component of safe and reliable autonomous navigation. An accurate understanding of the road environment aids this task. We observe that this information often follows conventions encoded in natural language, through design codes that reflect the road structure and road names that capture the road functionality. We augment this information in a lightweight manner to SMERF, a map-prior-based online lane-topology prediction model, by combining structured road metadata from OSM maps and lane-width priors from Road design manuals with the road centerline encodings. We evaluate our method on two geo-diverse complex intersection scenarios. Our method shows improvement in both lane and traffic element detection and their association. We report results using four topology-aware metrics to comprehensively assess the model performance. These results demonstrate the ability of our approach to generalize and scale to diverse topologies and conditions.


Inferring Driving Maps by Deep Learning-based Trail Map Extraction

Hubbertz, Michael, Colling, Pascal, Han, Qi, Meisen, Tobias

arXiv.org Artificial Intelligence

High-definition (HD) maps offer extensive and accurate environmental information about the driving scene, making them a crucial and essential element for planning within autonomous driving systems. T o avoid extensive efforts from manual labeling, methods for automating the map creation have emerged. Recent trends have moved from offline mapping to online mapping, ensuring availability and actuality of the utilized maps. While the performance has increased in recent years, online mapping still faces challenges regarding temporal consistency, sensor occlusion, runtime, and generalization. W e propose a novel offline mapping approach that integrates trails -- informal routes used by drivers -- into the map creation process. Our method aggregates trail data from the ego vehicle and other traffic participants to construct a comprehensive global map using transformer-based deep learning models. Unlike traditional offline mapping, our approach enables continuous updates while remaining sensor-agnostic, facilitating efficient data transfer . Our method demonstrates superior performance compared to state-of-the-art online mapping approaches, achieving improved generalization to previously unseen environments and sensor configurations.


SIO-Mapper: A Framework for Lane-Level HD Map Construction Using Satellite Images and OpenStreetMap with No On-Site Visits

Cho, Younghun, Ryu, Jee-Hwan

arXiv.org Artificial Intelligence

High-definition (HD) maps, particularly those containing lane-level information regarded as ground truth, are crucial for vehicle localization research. Traditionally, constructing HD maps requires highly accurate sensor measurements collection from the target area, followed by manual annotation to assign semantic information. Consequently, HD maps are limited in terms of geographic coverage. To tackle this problem, in this paper, we propose SIO-Mapper, a novel lane-level HD map construction framework that constructs city-scale maps without physical site visits by utilizing satellite images and OpenStreetmap data. One of the key contributions of SIO-Mapper is its ability to extract lane information more accurately by introducing SIO-Net, a novel deep learning network that integrates features from satellite image and OpenStreetmap using both Transformer-based and convolution-based encoders. Furthermore, to overcome challenges in merging lanes over large areas, we introduce a novel lane integration methodology that combines cluster-based and graph-based approaches. This algorithm ensures the seamless aggregation of lane segments with high accuracy and coverage, even in complex road environments. We validated SIO-Mapper on the Naver Labs Open Dataset and NuScenes dataset, demonstrating better performance in various environments including Korea, the United States, and Singapore compared to the state-of-the-art lane-level HD mapconstruction methods.


SMART: Advancing Scalable Map Priors for Driving Topology Reasoning

Ye, Junjie, Paz, David, Zhang, Hengyuan, Guo, Yuliang, Huang, Xinyu, Christensen, Henrik I., Wang, Yue, Ren, Liu

arXiv.org Artificial Intelligence

Topology reasoning is crucial for autonomous driving as it enables comprehensive understanding of connectivity and relationships between lanes and traffic elements. While recent approaches have shown success in perceiving driving topology using vehicle-mounted sensors, their scalability is hindered by the reliance on training data captured by consistent sensor configurations. We identify that the key factor in scalable lane perception and topology reasoning is the elimination of this sensor-dependent feature. To address this, we propose SMART, a scalable solution that leverages easily available standard-definition (SD) and satellite maps to learn a map prior model, supervised by large-scale geo-referenced high-definition (HD) maps independent of sensor settings. Attributed to scaled training, SMART alone achieves superior offline lane topology understanding using only SD and satellite inputs. Extensive experiments further demonstrate that SMART can be seamlessly integrated into any online topology reasoning methods, yielding significant improvements of up to 28% on the OpenLane-V2 benchmark.