Goto

Collaborating Authors

 Pu, Jian


Multi-modality Anomaly Segmentation on the Road

arXiv.org Artificial Intelligence

Semantic segmentation allows autonomous driving cars to understand the surroundings of the vehicle comprehensively. However, it is also crucial for the model to detect obstacles that may jeopardize the safety of autonomous driving systems. Based on our experiments, we find that current uni-modal anomaly segmentation frameworks tend to produce high anomaly scores for non-anomalous regions in images. Motivated by this empirical finding, we develop a multi-modal uncertainty-based anomaly segmentation framework, named MMRAS+, for autonomous driving systems. MMRAS+ effectively reduces the high anomaly outputs of non-anomalous classes by introducing text-modal using the CLIP text encoder. Indeed, MMRAS+ is the first multi-modal anomaly segmentation solution for autonomous driving. Moreover, we develop an ensemble module to further boost the anomaly segmentation performance. Experiments on RoadAnomaly, SMIYC, and Fishyscapes validation datasets demonstrate the superior performance of our method. The code is available in https://github.com/HengGao12/MMRAS_plus.


A2DO: Adaptive Anti-Degradation Odometry with Deep Multi-Sensor Fusion for Autonomous Navigation

arXiv.org Artificial Intelligence

Central to this promise is the ability to achieve real-time, precise localization, which is crucial for navigation and collision avoidance. Odometry stands out as a pivotal technology that empowers vehicles to determine their position and construct a map of the environment in real-time, without the need for pre-existing maps [1]. Despite its potential, traditional odometry systems often struggle to maintain localization accuracy under challenging conditions such as low-light scenarios, inclement weather, or obstructions. These scenarios underscore the pressing need for more robust SLAM solutions that can reliably operate under diverse real-world conditions. Multi-sensor fusion effectively addresses sensor degradation by combining data from complementary sensors, including cameras, LiDARs, and IMUs. Individual sensors may fail under specific conditions, such as LiDAR in rainy scenarios, cameras in low-light scenarios, and IMUs suffering from drift fusion. Previous geometric-based methods such as [2], [3] perform well in various scenarios. However, the reliance on rule-based approaches[4] for degraded sensor data makes these systems less effective in complex scenarios and requires significant manual calibration and tuning.


Detecting OOD Samples via Optimal Transport Scoring Function

arXiv.org Machine Learning

To deploy machine learning models in the real world, researchers have proposed many OOD detection algorithms to help models identify unknown samples during the inference phase and prevent them from making untrustworthy predictions. Unlike methods that rely on extra data for outlier exposure training, post hoc methods detect Out-of-Distribution (OOD) samples by developing scoring functions, which are model agnostic and do not require additional training. However, previous post hoc methods may fail to capture the geometric cues embedded in network representations. Thus, in this study, we propose a novel score function based on the optimal transport theory, named OTOD, for OOD detection. We utilize information from features, logits, and the softmax probability space to calculate the OOD score for each test sample. Our experiments show that combining this information can boost the performance of OTOD with a certain margin. Experiments on the CIFAR-10 and CIFAR-100 benchmarks demonstrate the superior performance of our method. Notably, OTOD outperforms the state-of-the-art method GEN by 7.19% in the mean FPR@95 on the CIFAR-10 benchmark using ResNet-18 as the backbone, and by 12.51% in the mean FPR@95 using WideResNet-28 as the backbone. In addition, we provide theoretical guarantees for OTOD. The code is available in https://github.com/HengGao12/OTOD.


Deep Incomplete Multi-view Learning via Cyclic Permutation of VAEs

arXiv.org Artificial Intelligence

Multi-View Representation Learning (MVRL) aims to derive a unified representation from multi-view data by leveraging shared and complementary information across views. However, when views are irregularly missing, the incomplete data can lead to representations that lack sufficiency and consistency. To address this, we propose Multi-View Permutation of Variational Auto-Encoders (MVP), which excavates invariant relationships between views in incomplete data. MVP establishes inter-view correspondences in the latent space of Variational Auto-Encoders, enabling the inference of missing views and the aggregation of more sufficient information. To derive a valid Evidence Lower Bound (ELBO) for learning, we apply permutations to randomly reorder variables for cross-view generation and then partition them by views to maintain invariant meanings under permutations. Additionally, we enhance consistency by introducing an informational prior with cyclic permutations of posteriors, which turns the regularization term into a similarity measure across distributions. We demonstrate the effectiveness of our approach on seven diverse datasets with varying missing ratios, achieving superior performance in multi-view clustering and generation tasks.


PAPL-SLAM: Principal Axis-Anchored Monocular Point-Line SLAM

arXiv.org Artificial Intelligence

In point-line SLAM systems, the utilization of line structural information and the optimization of lines are two significant problems. The former is usually addressed through structural regularities, while the latter typically involves using minimal parameter representations of lines in optimization. However, separating these two steps leads to the loss of constraint information to each other. We anchor lines with similar directions to a principal axis and optimize them with $n+2$ parameters for $n$ lines, solving both problems together. Our method considers scene structural information, which can be easily extended to different world hypotheses while significantly reducing the number of line parameters to be optimized, enabling rapid and accurate mapping and tracking. To further enhance the system's robustness and avoid mismatch, we have modeled the line-axis probabilistic data association and provided the algorithm for axis creation, updating, and optimization. Additionally, considering that most real-world scenes conform to the Atlanta World hypothesis, we provide a structural line detection strategy based on vertical priors and vanishing points. Experimental results and ablation studies on various indoor and outdoor datasets demonstrate the effectiveness of our system.


Rethinking Closed-loop Planning Framework for Imitation-based Model Integrating Prediction and Planning

arXiv.org Artificial Intelligence

In recent years, the integration of prediction and planning through neural networks has received substantial attention. Despite extensive studies on it, there is a noticeable gap in understanding the operation of such models within a closed-loop planning setting. To bridge this gap, we propose a novel closed-loop planning framework compatible with neural networks engaged in joint prediction and planning. The framework contains two running modes, namely planning and safety monitoring, wherein the neural network performs Motion Prediction and Planning (MPP) and Conditional Motion Prediction (CMP) correspondingly without altering architecture. We evaluate the efficacy of our framework using the nuPlan dataset and its simulator, conducting closed-loop experiments across diverse scenarios. The results demonstrate that the proposed framework ensures the feasibility and local stability of the planning process while maintaining safety with CMP safety monitoring. Compared to other learning-based methods, our approach achieves substantial improvement.


Towards Open-set Camera 3D Object Detection

arXiv.org Artificial Intelligence

Traditional camera 3D object detectors are typically trained to recognize a predefined set of known object classes. In real-world scenarios, these detectors may encounter unknown objects outside the training categories and fail to identify them correctly. To address this gap, we present OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects. The framework involves our proposed 3D Object Discovery Network (ODN3D), which is specifically trained using geometric cues such as the location and scale of 3D boxes to discover general 3D objects. ODN3D is trained in a class-agnostic manner, and the provided 3D object region proposals inherently come with data noise. To boost accuracy in identifying unknown objects, we introduce a Joint Objectness Selection (JOS) module. JOS selects the pseudo ground truth for unknown objects from the 3D object region proposals of ODN3D by combining the ODN3D objectness and camera feature attention objectness. Experiments on the nuScenes and KITTI datasets demonstrate the effectiveness of our framework in enabling camera 3D detectors to successfully identify unknown objects while also improving their performance on known objects.


OAML: Outlier Aware Metric Learning for OOD Detection Enhancement

arXiv.org Machine Learning

Out-of-distribution (OOD) detection methods have been developed to identify objects that a model has not seen during training. The Outlier Exposure (OE) methods use auxiliary datasets to train OOD detectors directly. However, the collection and learning of representative OOD samples may pose challenges. To tackle these issues, we propose the Outlier Aware Metric Learning (OAML) framework. The main idea of our method is to use the k-NN algorithm and Stable Diffusion model to generate outliers for training at the feature level without making any distributional assumptions. To increase feature discrepancies in the semantic space, we develop a mutual information-based contrastive learning approach for learning from OOD data effectively. Both theoretical and empirical results confirm the effectiveness of this contrastive learning technique. Furthermore, we incorporate knowledge distillation into our learning framework to prevent degradation of in-distribution classification accuracy. The combination of contrastive learning and knowledge distillation algorithms significantly enhances the performance of OOD detection. Experimental results across various datasets show that our method significantly outperforms previous OE methods.


EC-SLAM: Real-time Dense Neural RGB-D SLAM System with Effectively Constrained Global Bundle Adjustment

arXiv.org Artificial Intelligence

We introduce EC-SLAM, a real-time dense RGB-D simultaneous localization and mapping (SLAM) system utilizing Neural Radiance Fields (NeRF). Although recent NeRF-based SLAM systems have demonstrated encouraging outcomes, they have yet to completely leverage NeRF's capability to constrain pose optimization. By employing an effectively constrained global bundle adjustment (BA) strategy, our system makes use of NeRF's implicit loop closure correction capability. This improves the tracking accuracy by reinforcing the constraints on the keyframes that are most pertinent to the optimized current frame. In addition, by implementing a feature-based and uniform sampling strategy that minimizes the number of ineffective constraint points for pose optimization, we mitigate the effects of random sampling in NeRF. EC-SLAM utilizes sparse parametric encodings and the truncated signed distance field (TSDF) to represent the map in order to facilitate efficient fusion, resulting in reduced model parameters and accelerated convergence velocity. A comprehensive evaluation conducted on the Replica, ScanNet, and TUM datasets showcases cutting-edge performance, including enhanced reconstruction accuracy resulting from precise pose estimation, 21 Hz run time, and tracking precision improvements of up to 50\%. The source code is available at https://github.com/Lightingooo/EC-SLAM.


FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird's-Eye View and Perspective View

arXiv.org Artificial Intelligence

In autonomous driving, 3D occupancy prediction outputs voxel-wise status and semantic labels for more comprehensive understandings of 3D scenes compared with traditional perception tasks, such as 3D object detection and bird's-eye view (BEV) semantic segmentation. Recent researchers have extensively explored various aspects of this task, including view transformation techniques, ground-truth label generation, and elaborate network design, aiming to achieve superior performance. However, the inference speed, crucial for running on an autonomous vehicle, is neglected. To this end, a new method, dubbed FastOcc, is proposed. By carefully analyzing the network effect and latency from four parts, including the input image resolution, image backbone, view transformation, and occupancy prediction head, it is found that the occupancy prediction head holds considerable potential for accelerating the model while keeping its accuracy. Targeted at improving this component, the time-consuming 3D convolution network is replaced with a novel residual-like architecture, where features are mainly digested by a lightweight 2D BEV convolution network and compensated by integrating the 3D voxel features interpolated from the original image features. Experiments on the Occ3D-nuScenes benchmark demonstrate that our FastOcc achieves state-of-the-art results with a fast inference speed.