Engelmann, Francis
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding
Ji, Guangda, Weder, Silvan, Engelmann, Francis, Pollefeys, Marc, Blum, Hermann
The performance of neural networks scales with both their size and the amount of data they have been trained on. This is shown in both language and image generation. However, this requires scaling-friendly network architectures as well as large-scale datasets. Even though scaling-friendly architectures like transformers have emerged for 3D vision tasks, the GPT-moment of 3D vision remains distant due to the lack of training data. In this paper, we introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations. Specifically, we complement ARKitScenes dataset with dense semantic annotations that are automatically generated at scale. To this end, we extend LabelMaker, a recent automatic annotation pipeline, to serve the needs of large-scale pre-training. This involves extending the pipeline with cutting-edge segmentation models as well as making it robust to the challenges of large-scale processing. Further, we push forward the state-of-the-art performance on ScanNet and ScanNet200 dataset with prevalent 3D semantic segmentation models, demonstrating the efficacy of our generated dataset.
Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds
Lemke, Oliver, Bauer, Zuria, Zurbrügg, René, Pollefeys, Marc, Engelmann, Francis, Blum, Hermann
In recent years, modern techniques in deep learning and large-scale datasets have led to impressive progress in 3D instance segmentation, grasp pose estimation, and robotics. This allows for accurate detection directly in 3D scenes, object- and environment-aware grasp prediction, as well as robust and repeatable robotic manipulation. This work aims to integrate these recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments. Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation, alongside grasp pose estimation, to demonstrate dynamic picking of objects, and opening of drawers. We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening, reporting a 51% and 82% success rate respectively. Code of our framework as well as videos are available on: https://spot-compose.github.io/.
OpenSUN3D: 1st Workshop Challenge on Open-Vocabulary 3D Scene Understanding
Engelmann, Francis, Takmaz, Ayca, Schult, Jonas, Fedele, Elisabetta, Wald, Johanna, Peng, Songyou, Wang, Xi, Litany, Or, Tang, Siyu, Tombari, Federico, Pollefeys, Marc, Guibas, Leonidas, Tian, Hongbo, Wang, Chunjie, Yan, Xiaosheng, Wang, Bingwen, Zhang, Xuanyang, Liu, Xiao, Nguyen, Phuc, Nguyen, Khoi, Tran, Anh, Pham, Cuong, Huang, Zhening, Wu, Xiaoyang, Chen, Xi, Zhao, Hengshuang, Zhu, Lei, Lasenby, Joan
This report provides an overview of the challenge hosted at the OpenSUN3D Workshop on Open-Vocabulary 3D Scene Understanding held in conjunction with ICCV 2023. The goal of this workshop series is to provide a platform for exploration and discussion of open-vocabulary 3D scene understanding tasks, including but not limited to segmentation, detection and mapping. We provide an overview of the challenge hosted at the workshop, present the challenge dataset, the evaluation methodology, and brief descriptions of the winning methods. Additional details are available on the OpenSUN3D workshop website.
ICGNet: A Unified Approach for Instance-Centric Grasping
Zurbrügg, René, Liu, Yifan, Engelmann, Francis, Kumar, Suryansh, Hutter, Marco, Patil, Vaishakh, Yu, Fisher
Accurate grasping is the key to several robotic tasks including assembly and household robotics. Executing a successful grasp in a cluttered environment requires multiple levels of scene understanding: First, the robot needs to analyze the geometric properties of individual objects to find feasible grasps. These grasps need to be compliant with the local object geometry. Second, for each proposed grasp, the robot needs to reason about the interactions with other objects in the scene. Finally, the robot must compute a collision-free grasp trajectory while taking into account the geometry of the target object. Most grasp detection algorithms directly predict grasp poses in a monolithic fashion, which does not capture the composability of the environment. In this paper, we introduce an end-to-end architecture for object-centric grasping. The method uses pointcloud data from a single arbitrary viewing direction as an input and generates an instance-centric representation for each partially observed object in the scene. This representation is further used for object reconstruction and grasp detection in cluttered table-top scenes. We show the effectiveness of the proposed method by extensively evaluating it against state-of-the-art methods on synthetic datasets, indicating superior performance for grasping and reconstruction. Additionally, we demonstrate real-world applicability by decluttering scenes with varying numbers of objects.