Goto

Collaborating Authors

 room structure


Estimating Generic 3D Room Structures from 2D Annotations

Neural Information Processing Systems

Indoor rooms are among the most common use cases in 3D scene understanding. Current state-of-the-art methods for this task are driven by large annotated datasets. Room layouts are especially important, consisting of structural elements in 3D, such as wall, floor, and ceiling. However, they are difficult to annotate, especially on pure RGB video. We propose a novel method to produce generic 3D room layouts just from 2D segmentation masks, which are easy to annotate for humans. Based on these 2D annotations, we automatically reconstruct 3D plane equations for the structural elements and their spatial extent in the scene, and connect adjacent elements at the appropriate contact edges. We annotate and publicly release 2246 3D room layouts on the RealEstate10k dataset, containing YouTube videos. We demonstrate the high quality of these 3D layouts annotations with extensive experiments.


Estimating Generic 3D Room Structures from 2D Annotations

Neural Information Processing Systems

Indoor rooms are among the most common use cases in 3D scene understanding. Current state-of-the-art methods for this task are driven by large annotated datasets. Room layouts are especially important, consisting of structural elements in 3D, such as wall, floor, and ceiling. However, they are difficult to annotate, especially on pure RGB video. We propose a novel method to produce generic 3D room layouts just from 2D segmentation masks, which are easy to annotate for humans. Based on these 2D annotations, we automatically reconstruct 3D plane equations for the structural elements and their spatial extent in the scene, and connect adjacent elements at the appropriate contact edges.


LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

Zhao, Yizhou, Lin, Kaixiang, Jia, Zhiwei, Gao, Qiaozi, Thattai, Govind, Thomason, Jesse, Sukhatme, Gaurav S.

arXiv.org Artificial Intelligence

Learning-based methods for training embodied agents typically require a large number of high-quality scenes that contain realistic layouts and support meaningful interactions. However, current simulators for Embodied AI (EAI) challenges only provide simulated indoor scenes with a limited number of layouts. This paper presents Luminous, the first research framework that employs state-of-the-art indoor scene synthesis algorithms to generate large-scale simulated scenes for Embodied AI challenges. Further, we automatically and quantitatively evaluate the quality of generated indoor scenes via their ability to support complex household tasks. Luminous incorporates a novel scene generation algorithm (Constrained Stochastic Scene Generation (CSSG)), which achieves competitive performance with human-designed scenes. Within Luminous, the EAI task executor, task instruction generation module, and video rendering toolkit can collectively generate a massive multimodal dataset of new scenes for the training and evaluation of Embodied AI agents. Extensive experimental results demonstrate the effectiveness of the data generated by Luminous, enabling the comprehensive assessment of embodied agents on generalization and robustness.


Aligned Scene Modeling of a Robot's Vista Space — An Evaluation

Swadzba, Agnes (Bielefeld University) | Wachsmuth, Sven (Bielefeld University)

AAAI Conferences

One kind of meaningful structures in indoor rooms are supporting structures like tables and cupboards. A robot will need to know these structures for a natural interaction with the human and the environment. As bottom-up detection of such structures is a challenging problem, we propose to estimate potential supporting structures from a spatial description like ``a bowl on the table''. As language and cognition schematize the space in the same way it is possible to estimate the representation of the space underlying a scene description. To do so, we introduce the aligned modeling approach which consists of rules transforming a sequence of object relations into a set of trees and a methodology to ground the abstract representation of the scene layout in the current perception using detectors for small movable objects and an extraction of planar surfaces. An analysis of 30 descriptions shows the robustness of our approach to a variety of description strategies and object detection errors.