Chen, Changhao
SLAM in the Dark: Self-Supervised Learning of Pose, Depth and Loop-Closure from Thermal Images
Xu, Yangfan, Hao, Qu, Zhang, Lilian, Mao, Jun, He, Xiaofeng, Wu, Wenqi, Chen, Changhao
SLAM in the Dark: Self-Supervised Learning of Pose, Depth and Loop-Closure from Thermal Images Y angfan Xu, Qu Hao, Lilian Zhang, Jun Mao, Xiaofeng He, Wenqi Wu, Changhao Chen* Abstract -- Visual SLAM is essential for mobile robots, drone navigation, and VR/AR, but traditional RGB camera systems struggle in low-light conditions, driving interest in thermal SLAM, which excels in such environments. However, thermal imaging faces challenges like low contrast, high noise, and limited large-scale annotated datasets, restricting the use of deep learning in outdoor scenarios. We present DarkSLAM, a noval deep learning-based monocular thermal SLAM system designed for large-scale localization and reconstruction in complex lighting conditions.Our approach incorporates the Efficient Channel Attention (ECA) mechanism in visual odometry and the Selective Kernel Attention (SKA) mechanism in depth estimation to enhance pose accuracy and mitigate thermal depth degradation. Additionally, the system includes thermal depth-based loop closure detection and pose optimization, ensuring robust performance in low-texture thermal scenes. Extensive outdoor experiments demonstrate that DarkSLAM significantly outperforms existing methods like SC-Sfm-Learner and Shin et al., delivering precise localization and 3D dense mapping even in challenging nighttime environments. I. INTRODUCTION Simultaneous Localization and Mapping (SLAM) is crucial for intelligent systems from mobile robots, drones, to self-driving vehicles, enabling their real-time localization and mapping for autonomous navigation. Traditional visual SLAM systems, which rely on visible-light (RGB) cameras, struggle in challenging lighting conditions such as strong light, shadows, or nighttime, limiting their use in all-time scenarios. Thermal cameras, which detect heat radiation, offer a solution by functioning in darkness, smoke, and dust, complementing visible-light sensors. Recent improvements in thermal camera resolution and sensitivity have increased their reliability in autonomous systems.
ConcertoRL: An Innovative Time-Interleaved Reinforcement Learning Approach for Enhanced Control in Direct-Drive Tandem-Wing Vehicles
Zhang, Minghao, Song, Bifeng, Chen, Changhao, Lang, Xinyu
In control problems for insect-scale direct-drive experimental platforms under tandem wing influence, the precision and safety of control during plug-and-play online training and control processes are paramount. The primary challenge facing existing reinforcement learning models is their limited safety in the exploration process and the stability of the continuous training process. Addressing these challenges, we introduce the ConcertoRL algorithm to enhance control precision and stabilize the online training process, which consists of two main innovations: a time-interleaved mechanism to interweave classical controllers with reinforcement learning-based controllers aiming to improve control precision in the initial stages, a policy composer organizes the experience gained from previous learning to ensure the stability of the online training process. This paper conducts a series of experiments. First, experiments incorporating the time-interleaved mechanism demonstrate a substantial performance boost of approximately 70% over scenarios without reinforcement learning enhancements and a 50% increase in efficiency compared to reference controllers with doubled control frequencies. These results highlight the algorithm's ability to create a synergistic effect that exceeds the sum of its parts. Second, Ablation studies on the policy composer further reveal that this module significantly enhances the stability of ConcertoRL during online training. Lastly, experiments on the universality of the current ConcertoRL algorithm framework demonstrate its compatibility with various classical controllers, consistently achieving excellent control outcomes. ConcertoRL sets a new benchmark in control effectiveness for challenges posed by direct-drive platforms under tandem wing influence and establishes a comprehensive framework for integrating classical and reinforcement learning-based control methodologies.
DK-SLAM: Monocular Visual SLAM with Deep Keypoints Adaptive Learning, Tracking and Loop-Closing
Qu, Hao, Zhang, Lilian, Mao, Jun, Tie, Junbo, He, Xiaofeng, Hu, Xiaoping, Shi, Yifei, Chen, Changhao
Unreliable feature extraction and matching in handcrafted features undermine the performance of visual SLAM in complex real-world scenarios. While learned local features, leveraging CNNs, demonstrate proficiency in capturing high-level information and excel in matching benchmarks, they encounter challenges in continuous motion scenes, resulting in poor generalization and impacting loop detection accuracy. To address these issues, we present DK-SLAM, a monocular visual SLAM system with adaptive deep local features. MAML optimizes the training of these features, and we introduce a coarse-to-fine feature tracking approach. Initially, a direct method approximates the relative pose between consecutive frames, followed by a feature matching method for refined pose estimation. To counter cumulative positioning errors, a novel online learning binary feature-based online loop closure module identifies loop nodes within a sequence. Experimental results underscore DK-SLAM's efficacy, outperforms representative SLAM solutions, such as ORB-SLAM3 on publicly available datasets.
ReLoc-PDR: Visual Relocalization Enhanced Pedestrian Dead Reckoning via Graph Optimization
Chen, Zongyang, Pan, Xianfei, Chen, Changhao
Abstract-- Accurately and reliably positioning pedestrians in satellite-denied conditions remains a significant challenge. A graph optimization-based fusion mechanism with the Tukey kernel effectively corrects cumulative errors and mitigates the impact of abnormal visual observations. Real-world experiments demonstrate that our ReLoc-PDR surpasses representative methods in accuracy and robustness, achieving accurte and robust pedestrian positioning results using only a smartphone in challenging environments such as less-textured corridors and dark nighttime scenarios. The pursuit of a low-cost, robust, and self-contained positioning OBUST and accurate indoor pedestrian navigation plays a crucial role in enabling various location-based services system is of great importance to flexible and resiliant (LBS). Visual relocalization, which estimates environments is a fundamental requirement for numerous the 6 degree-of-freedom (DoF) pose of a query image against applications, including emergency rescue operations, an existing 3D map model, holds potential for achieving driftfree path guidance systems, and augmented reality experiences [1], global localization using only a camera. The existing indoor positioning solutions relying on the smartphones commonly are with built-in cameras, utilizing deployment of dedicated infrastructures are susceptible to signal image-based localization methods on mobile devices becomes interference and non-line-of-sight (NLOS) conditions, and a viable approach. However, the challenge lies in the susceptibility their widespread deployment can be prohibitively expensive. of image-based relocalization to environmental factors In contrast to infrastructure-based positioning methods, such as changes in lighting conditions and scene dynamics.
SelfOdom: Self-supervised Egomotion and Depth Learning via Bi-directional Coarse-to-Fine Scale Recovery
Qu, Hao, Zhang, Lilian, Hu, Xiaoping, He, Xiaofeng, Pan, Xianfei, Chen, Changhao
Accurately perceiving location and scene is crucial for autonomous driving and mobile robots. Recent advances in deep learning have made it possible to learn egomotion and depth from monocular images in a self-supervised manner, without requiring highly precise labels to train the networks. However, monocular vision methods suffer from a limitation known as scale-ambiguity, which restricts their application when absolute-scale is necessary. To address this, we propose SelfOdom, a self-supervised dual-network framework that can robustly and consistently learn and generate pose and depth estimates in global scale from monocular images. In particular, we introduce a novel coarse-to-fine training strategy that enables the metric scale to be recovered in a two-stage process. Furthermore, SelfOdom is flexible and can incorporate inertial data with images, which improves its robustness in challenging scenarios, using an attention-based fusion module. Our model excels in both normal and challenging lighting conditions, including difficult night scenes. Extensive experiments on public datasets have demonstrated that SelfOdom outperforms representative traditional and learning-based VO and VIO models.
Drone-NeRF: Efficient NeRF Based 3D Scene Reconstruction for Large-Scale Drone Survey
Jia, Zhihao, Wang, Bing, Chen, Changhao
Neural rendering has garnered substantial attention owing to its capacity for creating realistic 3D scenes. However, its applicability to extensive scenes remains challenging, with limitations in effectiveness. In this work, we propose the Drone-NeRF framework to enhance the efficient reconstruction of unbounded large-scale scenes suited for drone oblique photography using Neural Radiance Fields (NeRF). Our approach involves dividing the scene into uniform sub-blocks based on camera position and depth visibility. Sub-scenes are trained in parallel using NeRF, then merged for a complete scene. We refine the model by optimizing camera poses and guiding NeRF with a uniform sampler. Integrating chosen samples enhances accuracy. A hash-coded fusion MLP accelerates density representation, yielding RGB and Depth outputs. Our framework accounts for sub-scene constraints, reduces parallel-training noise, handles shadow occlusion, and merges sub-regions for a polished rendering result. This Drone-NeRF framework demonstrates promising capabilities in addressing challenges related to scene complexity, rendering efficiency, and accuracy in drone-obtained imagery.
Deep Learning for Inertial Positioning: A Survey
Chen, Changhao, Pan, Xianfei
Inertial sensors are widely utilized in smartphones, drones, robots, and IoT devices, playing a crucial role in enabling ubiquitous and reliable localization. Inertial sensor-based positioning is essential in various applications, including personal navigation, location-based security, and human-device interaction. However, low-cost MEMS inertial sensors' measurements are inevitably corrupted by various error sources, leading to unbounded drifts when integrated doubly in traditional inertial navigation algorithms, subjecting inertial positioning to the problem of error drifts. In recent years, with the rapid increase in sensor data and computational power, deep learning techniques have been developed, sparking significant research into addressing the problem of inertial positioning. Relevant literature in this field spans across mobile computing, robotics, and machine learning. In this article, we provide a comprehensive review of deep learning-based inertial positioning and its applications in tracking pedestrians, drones, vehicles, and robots. We connect efforts from different fields and discuss how deep learning can be applied to address issues such as sensor calibration, positioning error drift reduction, and multi-sensor fusion. This article aims to attract readers from various backgrounds, including researchers and practitioners interested in the potential of deep learning-based techniques to solve inertial positioning problems. Our review demonstrates the exciting possibilities that deep learning brings to the table and provides a roadmap for future research in this field.
Autonomous Learning for Face Recognition in the Wild via Ambient Wireless Cues
Lu, Chris Xiaoxuan, Kan, Xuan, Du, Bowen, Chen, Changhao, Wen, Hongkai, Markham, Andrew, Trigoni, Niki, Stankovic, John
Facial recognition is a key enabling component for emerging Internet of Things (IoT) services such as smart homes or responsive offices. Through the use of deep neural networks, facial recognition has achieved excellent performance. However, this is only possibly when trained with hundreds of images of each user in different viewing and lighting conditions. Clearly, this level of effort in enrolment and labelling is impossible for wide-spread deployment and adoption. Inspired by the fact that most people carry smart wireless devices with them, e.g. smartphones, we propose to use this wireless identifier as a supervisory label. This allows us to curate a dataset of facial images that are unique to a certain domain e.g. a set of people in a particular office. This custom corpus can then be used to finetune existing pre-trained models e.g. FaceNet. However, due to the vagaries of wireless propagation in buildings, the supervisory labels are noisy and weak.We propose a novel technique, AutoTune, which learns and refines the association between a face and wireless identifier over time, by increasing the inter-cluster separation and minimizing the intra-cluster distance. Through extensive experiments with multiple users on two sites, we demonstrate the ability of AutoTune to design an environment-specific, continually evolving facial recognition system with entirely no user effort.
DynaNet: Neural Kalman Dynamical Model for Motion Estimation and Prediction
Chen, Changhao, Lu, Chris Xiaoxuan, Wang, Bing, Trigoni, Niki, Markham, Andrew
Dynamical models estimate and predict the temporal evolution of physical systems. State Space Models (SSMs) in particular represent the system dynamics with many desirable properties, such as being able to model uncertainty in both the model and measurements, and optimal (in the Bayesian sense) recursive formulations e.g. the Kalman Filter. However, they require significant domain knowledge to derive the parametric form and considerable hand-tuning to correctly set all the parameters. Data driven techniques e.g. Recurrent Neural Networks have emerged as compelling alternatives to SSMs with wide success across a number of challenging tasks, in part due to their ability to extract relevant features from rich inputs. They however lack interpretability and robustness to unseen conditions. In this work, we present DynaNet, a hybrid deep learning and time-varying state-space model which can be trained end-to-end. Our neural Kalman dynamical model allows us to exploit the relative merits of each approach. We demonstrate state-of-the-art estimation and prediction on a number of physically challenging tasks, including visual odometry, sensor fusion for visual-inertial navigation and pendulum control. In addition we show how DynaNet can indicate failures through investigation of properties such as the rate of innovation (Kalman Gain).
Selective Sensor Fusion for Neural Visual-Inertial Odometry
Chen, Changhao, Rosa, Stefano, Miao, Yishu, Lu, Chris Xiaoxuan, Wu, Wei, Markham, Andrew, Trigoni, Niki
Deep learning approaches for Visual-Inertial Odometry (VIO) have proven successful, but they rarely focus on incorporating robust fusion strategies for dealing with imperfect input sensory data. We propose a novel end-to-end selective sensor fusion framework for monocular VIO, which fuses monocular images and inertial measurements in order to estimate the trajectory whilst improving robustness to real-life issues, such as missing and corrupted data or bad sensor synchronization. In particular, we propose two fusion modalities based on different masking strategies: deterministic soft fusion and stochastic hard fusion, and we compare with previously proposed direct fusion baselines. During testing, the network is able to selectively process the features of the available sensor modalities and produce a trajectory at scale. We present a thorough investigation on the performances on three public autonomous driving, Micro Aerial Vehicle (MAV) and hand-held VIO datasets. The results demonstrate the effectiveness of the fusion strategies, which offer better performances compared to direct fusion, particularly in presence of corrupted data. In addition, we study the interpretability of the fusion networks by visualising the masking layers in different scenarios and with varying data corruption, revealing interesting correlations between the fusion networks and imperfect sensory input data.