Monocular Depth in the Real World

#artificialintelligence 

Understanding scenes in 3D is both a fundamental challenge in computer vision and a breakthrough capability in practice. Learning how to infer depth only from cameras can indeed help to save lives, increase mobility, reduce costs, and improve manufacturing processes, among many other applications. In this follow-up post, we go one big step further: how to bring monodepth to the real world. Our core insight lies in jointly learning end-to-end multiple geometrically related prediction tasks, such as learning camera models, depth, and per-pixel motion in 2D (a.k.a. Furthermore, we go beyond relaxing geometric constraints and show how breakthroughs in neural architectures, including the famous transformers [7], enable generalization to multi-camera and multi-frame setups that are the norm in practice. Most works in self-supervised monocular depth estimation focus on only two of the three components required to use geometry as inductive biases for training: depth and ego-motion (a.k.a., pose).