DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction

Sun, Jingkai, Han, Gang, Sun, Pihai, Zhao, Wen, Cao, Jiahang, Wang, Jiaxu, Guo, Yijie, Zhang, Qiang

arXiv.org Artificial Intelligence 

Abstract-- Recent advancements in legged robot perceptive locomotion have shown promising progress. However, terrain-aware humanoid locomotion remains largely constrained to two paradigms: depth image-based end-to-end learning and elevation map-based methods. The former suffers from limited training efficiency and a significant sim-to-real gap in depth perception, while the latter depends heavily on multiple vision sensors and localization systems, resulting in latency and reduced robustness. T o overcome these challenges, we propose a novel framework that tightly integrates three key components: (1) T errain-A ware Locomotion Policy with a Blind Backbone, which leverages pre-trained elevation map-based perception to guide reinforcement learning with minimal visual input; (2) Multi-Modality Cross-Attention Transformer, which reconstructs structured terrain representations from noisy depth images; (3) Realistic Depth Images Synthetic Method, which employs self-occlusion-aware ray casting and noise-aware modeling to synthesize realistic depth observations, achieving over 30% reduction in terrain reconstruction error . This combination enables efficient policy training with limited data and hardware resources, while preserving critical terrain features essential for generalization. Humanoid robots offer immense potential for enabling autonomous mobility in human-centric, unstructured environments. Achieving this vision requires the development of perceptive locomotion systems that integrate visual perception and control, enabling real-time gait adaptation to complex terrain.