Goto

Collaborating Authors

 harmonizing overcomplete local network prediction


Depth from a Single Image by Harmonizing Overcomplete Local Network Predictions

Neural Information Processing Systems

A single color image can contain many cues informative towards different aspects of local geometric structure. We approach the problem of monocular depth estimation by using a neural network to produce a mid-level representation that summarizes these cues. This network is trained to characterize local scene geometry by predicting, at every image location, depth derivatives of different orders, orientations and scales. However, instead of a single estimate for each derivative, the network outputs probability distributions that allow it to express confidence about some coefficients, and ambiguity about others. Scene depth is then estimated by harmonizing this overcomplete set of network predictions, using a globalization procedure that finds a single consistent depth map that best matches all the local derivative distributions. We demonstrate the efficacy of this approach through evaluation on the NYU v2 depth data set.


Reviews: Depth from a Single Image by Harmonizing Overcomplete Local Network Predictions

Neural Information Processing Systems

While the paper is well structured and easy to follow until Section 3.1, there are some open questions on the technical side and the motivation behind the proposed steps. Why is a specific intermediate GMM representation needed instead of letting a neural network do what it is good for: learning good intermediate representations? Especially fixing the derivative filters of the first network seems an unnecessary restriction. The approach has also some similarity to using VLAD/FisherVector features as inputs or the more recently proposed NetVLAD neural network architecture. So what is the difference of the presented approach to the ones mentioned?


Depth from a Single Image by Harmonizing Overcomplete Local Network Predictions

Neural Information Processing Systems

A single color image can contain many cues informative towards different aspects of local geometric structure. We approach the problem of monocular depth estimation by using a neural network to produce a mid-level representation that summarizes these cues. This network is trained to characterize local scene geometry by predicting, at every image location, depth derivatives of different orders, orientations and scales. However, instead of a single estimate for each derivative, the network outputs probability distributions that allow it to express confidence about some coefficients, and ambiguity about others. Scene depth is then estimated by harmonizing this overcomplete set of network predictions, using a globalization procedure that finds a single consistent depth map that best matches all the local derivative distributions.