scene model
Magic3D: High-Resolution Text-to-3D Content Creation
Lin, Chen-Hsuan, Gao, Jun, Tang, Luming, Takikawa, Towaki, Zeng, Xiaohui, Huang, Xun, Kreis, Karsten, Fidler, Sanja, Liu, Ming-Yu, Lin, Tsung-Yi
DreamFusion has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: (a) extremely slow optimization of NeRF and (b) low-resolution image space supervision on NeRF, leading to low-quality 3D models with a long processing time. In this paper, we address these limitations by utilizing a two-stage optimization framework. First, we obtain a coarse model using a low-resolution diffusion prior and accelerate with a sparse 3D hash grid structure. Using the coarse representation as the initialization, we further optimize a textured 3D mesh model with an efficient differentiable renderer interacting with a high-resolution latent diffusion model. Our method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2x faster than DreamFusion (reportedly taking 1.5 hours on average), while also achieving higher resolution. User studies show 61.7% raters to prefer our approach over DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.
Text-To-4D Dynamic Scene Generation
Singer, Uriel, Sheynin, Shelly, Polyak, Adam, Ashual, Oron, Makarov, Iurii, Kokkinos, Filippos, Goyal, Naman, Vedaldi, Andrea, Parikh, Devi, Johnson, Justin, Taigman, Yaniv
We present MAV3D (Make-A-Video3D), a Generative models have seen tremendous recent progress, method for generating three-dimensional dynamic and can now generate realistic images from natural language scenes from text descriptions. Our approach uses prompts (Ramesh et al., 2022; Gafni et al., 2022; Rombach a 4D dynamic Neural Radiance Field (NeRF), et al., 2022; Saharia et al., 2022; Yu et al., 2022; Sheynin which is optimized for scene appearance, density, et al., 2022). This success has been extended beyond and motion consistency by querying a Text-to-2D images both temporally to synthesize videos (Singer Video (T2V) diffusion-based model. The dynamic et al., 2022; Ho et al., 2022) and spatially to produce 3D video output generated from the provided text can shapes (Poole et al., 2022; Lin et al., 2022; Nichol et al., be viewed from any camera location and angle, 2022b). However, these two categories of generative models and can be composited into any 3D environment.
Object-level 3D Semantic Mapping using a Network of Smart Edge Sensors
Hau, Julian, Bultmann, Simon, Behnke, Sven
Autonomous robots that interact with their environment require a detailed semantic scene model. For this, volumetric semantic maps are frequently used. The scene understanding can further be improved by including object-level information in the map. In this work, we extend a multi-view 3D semantic mapping system consisting of a network of distributed smart edge sensors with object-level information, to enable downstream tasks that need object-level input. Objects are represented in the map via their 3D mesh model or as an object-centric volumetric sub-map that can model arbitrary object geometry when no detailed 3D model is available. We propose a keypoint-based approach to estimate object poses via PnP and refinement via ICP alignment of the 3D object model with the observed point cloud segments. Object instances are tracked to integrate observations over time and to be robust against temporary occlusions. Our method is evaluated on the public Behave dataset where it shows pose estimation accuracy within a few centimeters and in real-world experiments with the sensor network in a challenging lab environment where multiple chairs and a table are tracked through the scene online, in real time even under high occlusions.
Unsupervised Object Learning via Common Fate
Tangemann, Matthias, Schneider, Steffen, von Kügelgen, Julius, Locatello, Francesco, Gehler, Peter, Brox, Thomas, Kümmerer, Matthias, Bethge, Matthias, Schölkopf, Bernhard
In human vision, the Principle of Common Fate of Gestalt Psychology (Wertheimer, 2012) has been shown to play an important role for object learning (Spelke, 1990). It posits that elements that are moving together tend to be perceived as one--a perceptual bias that may have evolved to be able to recognize camouflaged predators (Troscianko et al., 2009). In our work, we show that this principle can be successfully used also for machine vision by using it in a multi-stage object learning approach (Figure 1): First, we use unsupervised motion segmentation to obtain a candidate segmentation of a video frame. Second, we train generative object and background models on this segmentation. While the regions obtained by the motion segmentation are caused by objects moving in 3D, only visible parts can be segmented. To learn the actual objects (i.e., the causes), a crucial task for the object model is learning to generalize beyond the occlusions present in its input data. To measure success, we provide a dataset including object ground truth. As the last stage, we show that the learned object and background models can be combined into a flexible scene model that allows sampling manipulated novel scenes. Thus, in contrast to existing object-centric models trained end-to-end, our work aims at decomposing object learning into evaluable subproblems and testing the potential of exploiting object motions for building scalable object-centric models that allow for causally meaningful interventions in generation.
Aligned Scene Modeling of a Robot's Vista Space — An Evaluation
Swadzba, Agnes (Bielefeld University) | Wachsmuth, Sven (Bielefeld University)
One kind of meaningful structures in indoor rooms are supporting structures like tables and cupboards. A robot will need to know these structures for a natural interaction with the human and the environment. As bottom-up detection of such structures is a challenging problem, we propose to estimate potential supporting structures from a spatial description like ``a bowl on the table''. As language and cognition schematize the space in the same way it is possible to estimate the representation of the space underlying a scene description. To do so, we introduce the aligned modeling approach which consists of rules transforming a sequence of object relations into a set of trees and a methodology to ground the abstract representation of the scene layout in the current perception using detectors for small movable objects and an extraction of planar surfaces. An analysis of 30 descriptions shows the robustness of our approach to a variety of description strategies and object detection errors.
Large Margin Learning of Upstream Scene Understanding Models
Zhu, Jun, Li, Li-jia, Fei-fei, Li, Xing, Eric P.
Upstream supervised topic models have been widely used for complicated scene understanding. However, existing maximum likelihood estimation (MLE) schemes can make the prediction model learning independent of latent topic discovery and result in an imbalanced prediction rule for scene classification. This paper presents a joint max-margin and max-likelihood learning method for upstream scene understanding models, in which latent topic discovery and prediction model estimation are closely coupled and well-balanced. The optimization problem is efficiently solved with a variational EM procedure, which iteratively solves an online loss-augmented SVM. We demonstrate the advantages of the large-margin approach on both an 8-category sports dataset and the 67-class MIT indoor scene dataset for scene categorization.