stereo image pair
Export Reviews, Discussions, Author Feedback and Meta-Reviews
This paper addresses the problem of generating 3D object proposals given a stereo image pair from an autonomous driving vehicle. The paper proposes a set of features for a 3D cuboid over a point cloud and ground plane derived from the stereo image pair. The features include point cloud density, free space, object height prior, and object height relative to its surroundings. Note that the features are dependant on knowledge of the object class (other "objectness" proposal methods are agnostic to the object class). A structural SVM is trained to predict the "objectness" of the 3D cuboid proposal.
StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models
Wang, Lezhong, Frisvad, Jeppe Revall, Jensen, Mark Bo, Bigdeli, Siavash Arjomand
The demand for stereo images increases as manufacturers launch more XR devices. To meet this demand, we introduce StereoDiffusion, a method that, unlike traditional inpainting pipelines, is trainning free, remarkably straightforward to use, and it seamlessly integrates into the original Stable Diffusion model. Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs, without the need for fine-tuning model weights or any post-processing of images. Using the original input to generate a left image and estimate a disparity map for it, we generate the latent vector for the right image through Stereo Pixel Shift operations, complemented by Symmetric Pixel Shift Masking Denoise and Self-Attention Layers Modification methods to align the right-side image with the left-side image. Moreover, our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.
Learning to Render Novel Views from Wide-Baseline Stereo Pairs
Du, Yilun, Smith, Cameron, Tewari, Ayush, Sitzmann, Vincent
We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and due to the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. We demonstrate that our method learns powerful multi-view geometry priors while reducing the rendering time. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.
Unsupervised learning of depth and motion
Konda, Kishore, Memisevic, Roland
We present a model for the joint estimation of disparity and motion. The model is based on learning about the interrelations between images from multiple cameras, multiple frames in a video, or the combination of both. We show that learning depth and motion cues, as well as their combinations, from data is possible within a single type of architecture and a single type of learning algorithm, by using biologically inspired "complex cell" like units, which encode correlations between the pixels across image pairs. Our experimental results show that the learning of depth and motion makes it possible to achieve state-of-the-art performance in 3-D activity analysis, and to outperform existing hand-engineered 3-D motion features by a very large margin.