Towards Sharper Object Boundaries in Self-Supervised Depth Estimation

Cecille, Aurélien, Duffner, Stefan, Davoine, Franck, Agier, Rémi, Neveu, Thibault

arXiv.org Artificial Intelligence 

Monocular depth estimation is a fundamental problem in computer vision with applications in autonomous driving, robotics and augmented reality. Recently, self-supervised learning methods have achieved impressive results by using view synthesis as a supervisory signal, but despite these advances, handling depth discontinuities remains challenging. In most scenes, foreground objects occlude the background, creating depth discontinuities at object boundaries. Conventional models assign a single depth value per pixel, but edge uncertainty often causes depth values to be averaged between foreground and background depths, blurring transitions and introducing artifacts in the point cloud (see Figure 2). To address this, we propose to represent per-pixel depth as a multimodal distribution, explicitly modeling both depths at boundaries, preserving sharp transitions and removing artifacts.