occupancy prediction
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
- North America > United States > Texas (0.04)
- Europe > Netherlands > South Holland > Delft (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.92)
- Information Technology (1.00)
- Automobiles & Trucks (0.68)
- North America > United States > California (0.14)
- Asia > China > Shanghai > Shanghai (0.04)
Open Vocabulary 3D Occupancy Prediction from Images Supplementary Material
In this supplementary material, we first give additional details about the method in Sec. 1. Queries used for zero-shot semantic segmentation. We do this for all the annotated classes in the dataset (second column). One can see that, for example, class name'manmade' lacks descriptive specificity. In the text description of this class, we can find "... buildings, walls, guard rails, fences, poles, street signs, traffic lights ..." and more. Table 1: Queries used for zero-shot semantic segmentation.
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (0.95)
- North America > United States > California (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- (2 more...)
RadarOcc: Robust 3D Occupancy Prediction with 4D Imaging Radar
Current methods predominantly rely on LiDAR or camera inputs for 3D occupancy prediction. These methods are susceptible to adverse weather conditions, limiting the all-weather deployment of self-driving cars. To improve perception robustness, we leverage the recent advances in automotive radars and introduce a novel approach that utilizes 4D imaging radar sensors for 3D occupancy prediction. Our method, RadarOcc, circumvents the limitations of sparse radar point clouds by directly processing the 4D radar tensor, thus preserving essential scene details. RadarOcc innovatively addresses the challenges associated with the voluminous and noisy 4D radar data by employing Doppler bins descriptors, sidelobe-aware spatial sparsification, and range-wise self-attention mechanisms. To minimize the interpolation errors associated with direct coordinate transformations, we also devise a spherical-based feature encoding followed by spherical-to-Cartesian feature aggregation. We benchmark various baseline methods based on distinct modalities on the public K-Radar dataset. The results demonstrate RadarOcc's state-of-the-art performance in radar-based 3D occupancy prediction and promising results even when compared with LiDAR-or camera-based methods. Additionally, we present qualitative evidence of the superior performance of 4D radar in adverse weather conditions and explore the impact of key pipeline components through ablation studies.
- Transportation > Ground > Road (0.39)
- Information Technology > Robotics & Automation (0.39)
- Automobiles & Trucks (0.39)
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold.
VG3T: Visual Geometry Grounded Gaussian Transformer
Generating a coherent 3D scene representation from multi-view images is a fundamental yet challenging task. Existing methods often struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. To address this, we introduce VG3T, a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation. Unlike prior methods that infer Gaussians from single-view images, our model directly predicts a set of semantically attributed Gaussians in a joint, multi-view fashion. This novel approach overcomes the fragmentation and inconsistency inherent in view-by-view processing, offering a unified paradigm to represent both geometry and semantics. We also introduce two key components, Grid-Based Sampling and Positional Refinement, to mitigate the distance-dependent density bias common in pixel-aligned Gaussian initialization methods. Our VG3T shows a notable 1.7%p improvement in mIoU while using 46% fewer primitives than the previous state-of-the-art on the nuScenes benchmark, highlighting its superior efficiency and performance.