Goto

Collaborating Authors

 Hwang, Kyumin


Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces

arXiv.org Artificial Intelligence

Self-supervised monocular depth estimation (SSMDE) has gained attention in the field of deep learning as it estimates depth without requiring ground truth depth maps. This approach typically uses a photometric consistency loss between a synthesized image, generated from the estimated depth, and the original image, thereby reducing the need for extensive dataset acquisition. However, the conventional photometric consistency loss relies on the Lambertian assumption, which often leads to significant errors when dealing with reflective surfaces that deviate from this model. To address this limitation, we propose a novel framework that incorporates intrinsic image decomposition into SSMDE. Our method synergistically trains for both monocular depth estimation and intrinsic image decomposition. The accurate depth estimation facilitates multi-image consistency for intrinsic image decomposition by aligning different view coordinate systems, while the decomposition process identifies reflective areas and excludes corrupted gradients from the depth training process. Furthermore, our framework introduces a pseudo-depth generation and knowledge distillation technique to further enhance the performance of the student model across both reflective and non-reflective surfaces. Comprehensive evaluations on multiple datasets show that our approach significantly outperforms existing SSMDE baselines in depth prediction, especially on reflective surfaces.


Self-supervised Monocular Depth Estimation Robust to Reflective Surface Leveraged by Triplet Mining

arXiv.org Artificial Intelligence

Published as a conference paper at ICLR 2025S ELF-SUPERVISED M ONOCULAR D EPTH E STIMATION R OBUST TO R EFLECTIVE S URFACE L EVERAGED BY T RIPLET M INING Wonhyeok Choi 1,, Kyumin Hwang 1,, Wei Peng 2, Minwoo Choi 1, Sunghoon Im 1, Electrical Engineering and Computer Science 1, Psychiatry and Behavioral Sciences 2 Daegu Gyeongbuk Institute of Science and Technology 1, Stanford University 2 South Korea 1, USA 2 {smu06117,kyumin,subminu,sunghoonim} @dgist.ac.kr 1, wepeng@stanford.edu 2 A BSTRACT Self-supervised monocular depth estimation (SSMDE) aims to predict the dense depth map of a monocular image, by learning depth from RGB image sequences, eliminating the need for ground-truth depth labels. Although this approach simplifies data acquisition compared to supervised methods, it struggles with reflective surfaces, as they violate the assumptions of Lambertian reflectance, leading to inaccurate training on such surfaces. To tackle this problem, we propose a novel training strategy for an SSMDE by leveraging triplet mining to pinpoint reflective regions at the pixel level, guided by the camera geometry between different viewpoints. The proposed reflection-aware triplet mining loss specifically penalizes the inappropriate photometric error minimization on the localized reflective regions while preserving depth accuracy in non-reflective areas. We also incorporate a reflection-aware knowledge distillation method that enables a student model to selectively learn the pixel-level knowledge from reflective and non-reflective regions. Evaluation results on multiple datasets demonstrate that our method effectively enhances depth quality on reflective surfaces and outperforms state-of-the-art SSMDE baselines. This approach significantly simplifies data acquisition compared to traditional supervised methods (Fu et al., 2018; Lee et al., 2019; Bhat et al., 2021), which often involve high costs for annotation. As such, many SSMDE studies (Godard et al., 2019; Zhou et al., 2017; Garg et al., 2016; Guizilini et al., 2020) have explored its viability as a mainstay for applications such as autonomous driving, highlighting its potential in outdoor environments. Despite its advantages, SSMDE approaches typically challenge in accurate depth estimation on non-Lambertian surfaces such as mirrors, transparent objects, and specular surfaces. This difficulty primarily arises from the assumption of Lambertian reflectance (Basri & Jacobs, 2003) embedded in most SSMDE methods.


A Study on the Generality of Neural Network Structures for Monocular Depth Estimation

arXiv.org Artificial Intelligence

Monocular depth estimation has been widely studied, and significant improvements in performance have been recently reported. However, most previous works are evaluated on a few benchmark datasets, such as KITTI datasets, and none of the works provide an in-depth analysis of the generalization performance of monocular depth estimation. In this paper, we deeply investigate the various backbone networks (e.g.CNN and Transformer models) toward the generalization of monocular depth estimation. First, we evaluate state-of-the-art models on both in-distribution and out-of-distribution datasets, which have never been seen during network training. Then, we investigate the internal properties of the representations from the intermediate layers of CNN-/Transformer-based models using synthetic texture-shifted datasets. Through extensive experiments, we observe that the Transformers exhibit a strong shape-bias rather than CNNs, which have a strong texture-bias. We also discover that texture-biased models exhibit worse generalization performance for monocular depth estimation than shape-biased models. We demonstrate that similar aspects are observed in real-world driving datasets captured under diverse environments. Lastly, we conduct a dense ablation study with various backbone networks which are utilized in modern strategies. The experiments demonstrate that the intrinsic locality of the CNNs and the self-attention of the Transformers induce texture-bias and shape-bias, respectively.