waymo open dataset
a64e641fa00a7eb9500cb7e1835d0495-Supplemental-Conference.pdf
Table A1: 3D semantic segmentation results on the SemanticKiTTI validation set. We implemented our method with Pytorch using the open-source OpenPCDet [1]. The faded strategy was used during the last 5 epochs. It provides 22 sequences with 19 semantic classes, captured by a 64-beam LiDAR sensor. The 4th and 5th models sequentially incorporate our proposed SED blocks and DED blocks. Center-based 3d object detection and tracking.
Fully Sparse 3D Object Detection
As the perception range of LiDAR increases, LiDAR-based 3D object detection becomes a dominant task in the long-range perception task of autonomous driving. The mainstream 3D object detectors usually build dense feature maps in the network backbone and prediction head. However, the computational and spatial costs on the dense feature map are quadratic to the perception range, which makes them hardly scale up to the long-range setting. To enable efficient long-range LiDAR-based object detection, we build a fully sparse 3D object detector (FSD). The computational and spatial cost of FSD is roughly linear to the number of points and independent of the perception range.
Data-Efficient Point Cloud Semantic Segmentation Pipeline for Unimproved Roads
Yarovoi, Andrew, Valenta, Christopher R.
--In this case study, we present a data-efficient point cloud segmentation pipeline and training framework for robust segmentation of unimproved roads and seven other classes. Our method employs a two-stage training framework: first, a projection-based convolutional neural network is pre-trained on a mixture of public urban datasets and a small, curated in-domain dataset; then, a lightweight prediction head is fine-tuned exclusively on in-domain data. Along the way, we explore the application of Point Prompt Training to batch normalization layers and the effects of Manifold Mixup as a regularizer within our pipeline. We also explore the effects of incorporating histogram-normalized ambients to further boost performance. Using only 50 labeled point clouds from our target domain, we show that our proposed training approach improves mean Intersection-over-Union from 33.5% to 51.8% and the overall accuracy from 85.5% to 90.8%, when compared to naive training on the in-domain data. Crucially, our results demonstrate that pre-training across multiple datasets is key to improving generalization and enabling robust segmentation under limited in-domain supervision. Overall, this study demonstrates a practical framework for robust 3D semantic segmentation in challenging, low-data scenarios. Semantic segmentation of 3D point clouds is a foundational task for scene understanding, enabling a range of downstream applications such as autonomous route planning and infrastructure inspection. Despite significant progress in this field, most state-of-the-art segmentation models rely heavily on the availability of large, labeled training datasets. However, generating labeled point cloud data remains a substantial bottleneck: manual annotation is both labor-intensive and time-consuming, requiring over 30 minutes per scan on average in our experiments. This challenge makes it impractical to recreate large-scale datasets, commonly containing over 25,000 scans, for new or underrepresented environments.
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > California (0.04)
- North America > United States > Arizona (0.04)
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
Li, Xiaofan, Wu, Chenming, Yang, Zhao, Xu, Zhihao, Liang, Dingkang, Zhang, Yumeng, Wan, Ji, Wang, Jun
This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low-fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine-grained, trajectory-specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter-frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. The code and models will be released to the public.
Fully Sparse 3D Object Detection
As the perception range of LiDAR increases, LiDAR-based 3D object detection becomes a dominant task in the long-range perception task of autonomous driving. The mainstream 3D object detectors usually build dense feature maps in the network backbone and prediction head. However, the computational and spatial costs on the dense feature map are quadratic to the perception range, which makes them hardly scale up to the long-range setting. To enable efficient long-range LiDAR-based object detection, we build a fully sparse 3D object detector (FSD). The computational and spatial cost of FSD is roughly linear to the number of points and independent of the perception range.