Goto

Collaborating Authors

 depthnet





We thank all reviewers for their efforts and positive comments

Neural Information Processing Systems

We thank all reviewers for their efforts and positive comments. To Review #1: Thank you for the comments and we will follow the suggestions to do more experiments and analyses. To Review #2: Thank you very much for confirming our contributions. Q1) [Strong priors]: "geometry-consistency" is not a manually designed prior for specific scenes. Our contribution is more deeply leveraging SfM theory in unsupervised depth learning.


SLAM in the Dark: Self-Supervised Learning of Pose, Depth and Loop-Closure from Thermal Images

Xu, Yangfan, Hao, Qu, Zhang, Lilian, Mao, Jun, He, Xiaofeng, Wu, Wenqi, Chen, Changhao

arXiv.org Artificial Intelligence

SLAM in the Dark: Self-Supervised Learning of Pose, Depth and Loop-Closure from Thermal Images Y angfan Xu, Qu Hao, Lilian Zhang, Jun Mao, Xiaofeng He, Wenqi Wu, Changhao Chen* Abstract -- Visual SLAM is essential for mobile robots, drone navigation, and VR/AR, but traditional RGB camera systems struggle in low-light conditions, driving interest in thermal SLAM, which excels in such environments. However, thermal imaging faces challenges like low contrast, high noise, and limited large-scale annotated datasets, restricting the use of deep learning in outdoor scenarios. We present DarkSLAM, a noval deep learning-based monocular thermal SLAM system designed for large-scale localization and reconstruction in complex lighting conditions.Our approach incorporates the Efficient Channel Attention (ECA) mechanism in visual odometry and the Selective Kernel Attention (SKA) mechanism in depth estimation to enhance pose accuracy and mitigate thermal depth degradation. Additionally, the system includes thermal depth-based loop closure detection and pose optimization, ensuring robust performance in low-texture thermal scenes. Extensive outdoor experiments demonstrate that DarkSLAM significantly outperforms existing methods like SC-Sfm-Learner and Shin et al., delivering precise localization and 3D dense mapping even in challenging nighttime environments. I. INTRODUCTION Simultaneous Localization and Mapping (SLAM) is crucial for intelligent systems from mobile robots, drones, to self-driving vehicles, enabling their real-time localization and mapping for autonomous navigation. Traditional visual SLAM systems, which rely on visible-light (RGB) cameras, struggle in challenging lighting conditions such as strong light, shadows, or nighttime, limiting their use in all-time scenarios. Thermal cameras, which detect heat radiation, offer a solution by functioning in darkness, smoke, and dust, complementing visible-light sensors. Recent improvements in thermal camera resolution and sensitivity have increased their reliability in autonomous systems.


Leveraging Scene Geometry and Depth Information for Robust Image Deraining

Xu, Ningning, Yang, Jidong J.

arXiv.org Artificial Intelligence

Image deraining holds great potential for enhancing the vision of autonomous vehicles in rainy conditions, contributing to safer driving. Previous works have primarily focused on employing a single network architecture to generate derained images. However, they often fail to fully exploit the rich prior knowledge embedded in the scenes. Particularly, most methods overlook the depth information that can provide valuable context about scene geometry and guide more robust deraining. In this work, we introduce a novel learning framework that integrates multiple networks: an AutoEncoder for deraining, an auxiliary network to incorporate depth information, and two supervision networks to enforce feature consistency between rainy and clear scenes. This multi-network design enables our model to effectively capture the underlying scene structure, producing clearer and more accurately derained images, leading to improved object detection for autonomous vehicles. Extensive experiments on three widely-used datasets demonstrated the effectiveness of our proposed method.


DCPI-Depth: Explicitly Infusing Dense Correspondence Prior to Unsupervised Monocular Depth Estimation

Zhang, Mengtan, Feng, Yi, Chen, Qijun, Fan, Rui

arXiv.org Artificial Intelligence

There has been a recent surge of interest in learning to perceive depth from monocular videos in an unsupervised fashion. A key challenge in this field is achieving robust and accurate depth estimation in challenging scenarios, particularly in regions with weak textures or where dynamic objects are present. This study makes three major contributions by delving deeply into dense correspondence priors to provide existing frameworks with explicit geometric constraints. The first novelty is a contextual-geometric depth consistency loss, which employs depth maps triangulated from dense correspondences based on estimated ego-motion to guide the learning of depth perception from contextual information, since explicitly triangulated depth maps capture accurate relative distances among pixels. The second novelty arises from the observation that there exists an explicit, deducible relationship between optical flow divergence and depth gradient. A differential property correlation loss is, therefore, designed to refine depth estimation with a specific emphasis on local variations. The third novelty is a bidirectional stream co-adjustment strategy that enhances the interaction between rigid and optical flows, encouraging the former towards more accurate correspondence and making the latter more adaptable across various scenarios under the static scene hypotheses. DCPI-Depth, a framework that incorporates all these innovative components and couples two bidirectional and collaborative streams, achieves state-of-the-art performance and generalizability across multiple public datasets, outperforming all existing prior arts. Specifically, it demonstrates accurate depth estimation in texture-less and dynamic regions, and shows more reasonable smoothness.


Real-time Holistic Robot Pose Estimation with Unknown States

Ban, Shikun, Fan, Juling, Zhu, Wentao, Ma, Xiaoxuan, Qiao, Yu, Wang, Yizhou

arXiv.org Artificial Intelligence

Estimating robot pose from RGB images is a crucial problem in computer vision and robotics. While previous methods have achieved promising performance, most of them presume full knowledge of robot internal states, e.g. ground-truth robot joint angles, which are not always available in real-world scenarios. On the other hand, existing approaches that estimate robot pose without joint state priors suffer from heavy computation burdens and thus cannot support real-time applications. This work addresses the urgent need for efficient robot pose estimation with unknown states. We propose an end-to-end pipeline for real-time, holistic robot pose estimation from a single RGB image, even in the absence of known robot states. Our method decomposes the problem into estimating camera-to-robot rotation, robot state parameters, keypoint locations, and root depth. We further design a corresponding neural network module for each task. This approach allows for learning multi-facet representations and facilitates sim-to-real transfer through self-supervised learning. Notably, our method achieves inference with a single feedforward, eliminating the need for costly test-time iterative optimization. As a result, it delivers a 12-time speed boost with state-of-the-art accuracy, enabling real-time holistic robot pose estimation for the first time. Code is available at https://oliverbansk.github.io/Holistic-Robot-Pose/.


SelfOdom: Self-supervised Egomotion and Depth Learning via Bi-directional Coarse-to-Fine Scale Recovery

Qu, Hao, Zhang, Lilian, Hu, Xiaoping, He, Xiaofeng, Pan, Xianfei, Chen, Changhao

arXiv.org Artificial Intelligence

Accurately perceiving location and scene is crucial for autonomous driving and mobile robots. Recent advances in deep learning have made it possible to learn egomotion and depth from monocular images in a self-supervised manner, without requiring highly precise labels to train the networks. However, monocular vision methods suffer from a limitation known as scale-ambiguity, which restricts their application when absolute-scale is necessary. To address this, we propose SelfOdom, a self-supervised dual-network framework that can robustly and consistently learn and generate pose and depth estimates in global scale from monocular images. In particular, we introduce a novel coarse-to-fine training strategy that enables the metric scale to be recovered in a two-stage process. Furthermore, SelfOdom is flexible and can incorporate inertial data with images, which improves its robustness in challenging scenarios, using an attention-based fusion module. Our model excels in both normal and challenging lighting conditions, including difficult night scenes. Extensive experiments on public datasets have demonstrated that SelfOdom outperforms representative traditional and learning-based VO and VIO models.


Unsupervised Depth Estimation, 3D Face Rotation and Replacement

Moniz, Joel Ruben Antony, Beckham, Christopher, Rajotte, Simon, Honari, Sina, Pal, Chris

Neural Information Processing Systems

We present an unsupervised approach for learning to estimate three dimensional (3D) facial structure from a single image while also predicting 3D viewpoint transformations that match a desired pose and facial geometry. We achieve this by inferring the depth of facial keypoints of an input image in an unsupervised manner, without using any form of ground-truth depth information. We show how it is possible to use these depths as intermediate computations within a new backpropable loss to predict the parameters of a 3D affine transformation matrix that maps inferred 3D keypoints of an input face to the corresponding 2D keypoints on a desired target facial geometry or pose. Our resulting approach, called DepthNets, can therefore be used to infer plausible 3D transformations from one face pose to another, allowing faces to be frontalized, transformed into 3D models or even warped to another pose and facial geometry. Lastly, we identify certain shortcomings with our formulation, and explore adversarial image translation techniques as a post-processing step to re-synthesize complete head shots for faces re-targeted to different poses or identities.