CVD-SfM: A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes

Li, Yaxuan, Huang, Yewei, Gaudel, Bijay, Jafarnejadsani, Hamidreza, Englot, Brendan

arXiv.org Artificial Intelligence 

-- We present a novel multi-altitude camera pose estimation system, addressing the challenges of robust and accurate localization across varied altitudes when only considering sparse image input. The system effectively handles diverse environmental conditions and viewpoint variations by integrating the cross-view transformer, deep features, and structure-from-motion into a unified framework. T o benchmark our method and foster further research, we introduce two newly collected datasets specifically tailored for multi-altitude camera pose estimation; datasets of this nature remain rare in the current literature. The proposed framework has been validated through extensive comparative analyses on these datasets, demonstrating that our system achieves superior performance in both accuracy and robustness for multi-altitude sparse pose estimation tasks compared to existing solutions, making it well suited for real-world robotic applications such as aerial navigation, search and rescue, and automated inspection. I. INTRODUCTION Structure-from-motion (SfM) [1], [2], [3] has been receiving extensive attention in the field of computer vision and robotics; it is pivotal in various real-world applications, including autonomous navigation [4], rapid mapping after natural disasters for situational awareness, detailed preservation of historical landmarks [5], and immersive virtual reality (VR) experiences. As research progresses, it provides a cornerstone in achieving pose estimation and reconstruction, standing out as a particularly effective technique, specifically when input is limited to sparse images. While conventional SfM approaches deliver impressive results under conditions of abundant image overlap, they often struggle with sparse input captured at vastly different altitudes. In these scenarios, the drastic viewpoint differences limit shared visual features, making it difficult to establish reliable correspondences.