M3Depth: Wavelet-Enhanced Depth Estimation on Mars via Mutual Boosting of Dual-Modal Data

Li, Junjie, Wang, Jiawei, Li, Miyu, Liu, Yu, Wang, Yumei, Xu, Haitao

arXiv.org Artificial Intelligence 

--Depth estimation plays a great potential role in obstacle avoidance and navigation for further Mars exploration missions. Compared to traditional stereo matching, learning-based stereo depth estimation provides a data-driven approach to infer dense and precise depth maps from stereo image pairs. However, these methods always suffer performance degradation in environments with sparse textures and lacking geometric constraints, such as the unstructured terrain of Mars. Depth, a depth estimation model tailored for Mars rovers. Considering the sparse and smooth texture of Martian terrain, which is primarily composed of low-frequency features, our model incorporates a convolutional kernel based on wavelet transform that effectively captures low-frequency response and expands the receptive field. Additionally, we introduce a consistency loss that explicitly models the complementary relationship between depth map and surface normal map, utilizing the surface normal as a geometric constraint to enhance the accuracy of depth estimation. Besides, a pixel-wise refinement module with mutual boosting mechanism is designed to iteratively refine both depth and surface normal predictions. Depth achieves a 16% improvement in depth estimation accuracy compared to other state-of-the-art methods in depth estimation. Furthermore, the model demonstrates strong applicability in real-world Martian scenarios, offering a promising solution for future Mars exploration missions. IMITED scene perception capabilities have become a critical bottleneck in the traveling speed of current Mars rovers [1], which hinders the efficient completion of scientific tasks. For example, the Curiosity Rover encounters delays and slowdowns when navigating around obstacles like rocks, resulting in an average travel distance of only 28.9 meters per sol [2]. Similarly, the Zhurong Rover covers merely 6.2 This work was supported in part by the National Key Research and Development Program of China under Grant 2022YFB2902705, in part by Beijing University of Posts and Telecommunications (BUPT) Excellent Ph.D. Students Foundation under Grant CX20241090, and in part by BUPT Innovation and Entrepreneurship Support Program under Grant 2025-YC-T025. Wang are with the School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China (e-mail: junjie@bupt.edu.cn; J. Wang is with State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China (e-mail: wangjiawei98@bupt.edu.cn). H. Xu is with National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China (e-mail: xuhaitao@nssc.ac.cn) Figure 1. Depth estimation holds great potential for enhancing scene perception. It provides a more comprehensive understanding of the 3D structure [4] compared to 2D approaches, such as terrain categorization [5] and semantic segmentation [6].