We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four components: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene from the latent code (iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed, called 3D-IQTT, to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape's ability to solve scene reconstruction, generation, and understanding tasks.
We proposed a novel graph convolutional neural network that could construct a coarse, sparse latent point cloud from a dense, raw point cloud. With a novel non-isotropic convolution operation defined on irregular geometries, the model then can reconstruct the original point cloud from this latent cloud with fine details. Furthermore, we proposed that it is even possible to perform particle simulation using the latent cloud encoded from some simulated particle cloud (e.g. fluids), to accelerate the particle simulation process. Our model has been tested on ShapeNetCore dataset for Auto-Encoding with a limited latent dimension and tested on a synthesis dataset for fluids simulation. We also compare the model with other state-of-the-art models, and several visualizations were done to intuitively understand the model.
We propose a method to learn object representations from 3D point clouds using bundles of geometrically interpretable hidden units, which we call "geometric capsules". Each geometric capsule represents a visual entity, such as an object or a part, and consists of two components: a pose and a feature. The pose encodes "where" the entity is, while the feature encodes "what" it is. We use these capsules to construct a Geometric Capsule Autoencoder that learns to group 3D points into parts (small local surfaces), and these parts into the whole object, in an unsupervised manner. Our novel Multi-View Agreement voting mechanism is used to discover an object's canonical pose and its pose-invariant feature vector. Using the ShapeNet and Mod-elNet40 datasets, we analyze the properties of the learned representations and show the benefits of having multiple votes agree. We perform alignment and retrieval of arbitrarily rotated objects - tasks that evaluate our model's object identification and canonical pose recovery capabilities - and obtained insightful results. 1 Introduction Capsule networks structure hidden units into groups, called capsules . Each capsule represents a single visual entity and its hidden units collectively encode all the information regarding the entity in one place. For example, the length of the hidden unit vector can represent the existence of the entity and its direction can represent the entity's instantiation parameters ("pose") . Capsule networks combine the expressive power of distributed representations (used within each capsule) with the interpretability of having one computational entity per real-world entity. These capsules can be organized in a hierarchical manner to encode a visual scene. Low-level capsules can be used to represent low-level visual entities (such as edges or object parts), while high-level capsules may represent entire objects. A routing algorithm [16, 8] is used to discover the connections between the low-level and high-level capsules. This makes it easy to introduce priors, such as "one part can only belong to one object" by enforcing a mutually ex-Figure 1: Model Overview.
Abstract--Point clouds are often sparse and incomplete, which imposes difficulties for real-world applications, such as 3D object classification, detection and segmentation. Existing shape completion methods tend to generate coarse shapes of objects without finegrained details. Moreover, current approaches require fully-complete ground truth, which are difficult to obtain in real-world applications. In view of these, we propose a self-supervised object completion method, which optimizes the training procedure solely on the partial input without utilizing the fully-complete ground truth. In order to generate high-quality objects with detailed geometric structures, we propose a cascaded refinement network (CRN) with a coarse-to-fine strategy to synthesize the complete objects. Considering the local details of partial input together with the adversarial training, we are able to learn the complicated distributions of point clouds and generate the object details as realistic as possible. We verify our self-supervised method on both unsupervised and supervised experimental settings and show superior performances. Quantitative and qualitative experiments on different datasets demonstrate that our method achieves more realistic outputs compared to existing state-of-the-art approaches on the 3D point cloud completion task. A large amount of works , , ,  have been proposed for point cloud analysis by directly extracting pointwise features from the point coordinates.
We present DRACO, a method for Dense Reconstruction And Canonicalization of Object shape from one or more RGB images. Canonical shape reconstruction, estimating 3D object shape in a coordinate space canonicalized for scale, rotation, and translation parameters, is an emerging paradigm that holds promise for a multitude of robotic applications. Prior approaches either rely on painstakingly gathered dense 3D supervision, or produce only sparse canonical representations, limiting real-world applicability. DRACO performs dense canonicalization using only weak supervision in the form of camera poses and semantic keypoints at train time. During inference, DRACO predicts dense object-centric depth maps in a canonical coordinate-space, solely using one or more RGB images of an object. Extensive experiments on canonical shape reconstruction and pose estimation show that DRACO is competitive or superior to fully-supervised methods.