Plotting

 Carlone, Luca


Task and Motion Planning in Hierarchical 3D Scene Graphs

arXiv.org Artificial Intelligence

Recent work in the construction of 3D scene graphs has enabled mobile robots to build large-scale hybrid metric-semantic hierarchical representations of the world. These detailed models contain information that is useful for planning, however how to derive a planning domain from a 3D scene graph that enables efficient computation of executable plans is an open question. In this work, we present a novel approach for defining and solving Task and Motion Planning problems in large-scale environments using hierarchical 3D scene graphs. We identify a method for building sparse problem domains which enable scaling to large scenes, and propose a technique for incrementally adding objects to that domain during planning time to avoid wasting computation on irrelevant elements of the scene graph. We test our approach in two hand crafted domains as well as two scene graphs built from perception, including one constructed from the KITTI dataset. A video supplement is available at https://youtu.be/63xuCCaN0I4.


GMKF: Generalized Moment Kalman Filter for Polynomial Systems with Arbitrary Noise

arXiv.org Artificial Intelligence

This paper develops a new filtering approach for state estimation in polynomial systems corrupted by arbitrary noise, which commonly arise in robotics. We first consider a batch setup where we perform state estimation using all data collected from the initial to the current time. We formulate the batch state estimation problem as a Polynomial Optimization Problem (POP) and relax the assumption of Gaussian noise by specifying a finite number of moments of the noise. We solve the resulting POP using a moment relaxation and prove that under suitable conditions on the rank of the relaxation, (i) we can extract a provably optimal estimate from the moment relaxation, and (ii) we can obtain a belief representation from the dual (sum-of-squares) relaxation. We then turn our attention to the filtering setup and apply similar insights to develop a GMKF for recursive state estimation in polynomial systems with arbitrary noise. The GMKF formulates the prediction and update steps as POPs and solves them using moment relaxations, carrying over a possibly non-Gaussian belief. In the linear-Gaussian case, GMKF reduces to the standard Kalman Filter. We demonstrate that GMKF performs well under highly non-Gaussian noise and outperforms common alternatives, including the Extended and Unscented Kalman Filter, and their variants on matrix Lie group.


Kimera2: Robust and Accurate Metric-Semantic SLAM in the Real World

arXiv.org Artificial Intelligence

In particular, we enhance Kimera-VIO, the visual-inertial odometry pipeline powering Kimera, to support better feature tracking, more efficient keyframe selection, and various input modalities (e.g., monocular, stereo, and RGB-D images, as well as wheel odometry). Additionally, Kimera-RPGO and Kimera-PGMO, Kimera's pose-graph optimization backends, are updated to support modern outlier rejection methods --specifically, Graduated-Non-Convexity-- for improved robustness to spurious loop closures. These new features are evaluated extensively on a variety of simulated and real robotic platforms, including drones, quadrupeds, wheeled robots, and simulated self-driving cars. We present comparisons against several state-of-the-art visual-inertial SLAM pipelines and discuss strengths and weaknesses of the new release of Kimera.


Computation-Communication Trade-offs and Sensor Selection in Real-time Estimation for Processing Networks

arXiv.org Artificial Intelligence

Recent advances in electronics are enabling substantial processing to be performed at each node (robots, sensors) of a networked system. Local processing enables data compression and may mitigate measurement noise, but it is still slower compared to a central computer (it entails a larger computational delay). However, while nodes can process the data in parallel, the centralized computational is sequential in nature. On the other hand, if a node sends raw data to a central computer for processing, it incurs communication delay. This leads to a fundamental communication-computation trade-off, where each node has to decide on the optimal amount of preprocessing in order to maximize the network performance. We consider a network in charge of estimating the state of a dynamical system and provide three contributions. First, we provide a rigorous problem formulation for optimal real-time estimation in processing networks in the presence of delays. Second, we show that, in the case of a homogeneous network (where all sensors have the same computation) that monitors a continuous-time scalar linear system, the optimal amount of local preprocessing maximizing the network estimation performance can be computed analytically. Third, we consider the realistic case of a heterogeneous network monitoring a discrete-time multi-variate linear system and provide algorithms to decide on suitable preprocessing at each node, and to select a sensor subset when computational constraints make using all sensors suboptimal. Numerical simulations show that selecting the sensors is crucial. Moreover, we show that if the nodes apply the preprocessing policy suggested by our algorithms, they can largely improve the network estimation performance.


Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies

arXiv.org Artificial Intelligence

This paper proposes an approach to build 3D scene graphs in arbitrary (indoor and outdoor) environments. Such extension is challenging; the hierarchy of concepts that describe an outdoor environment is more complex than for indoors, and manually defining such hierarchy is time-consuming and does not scale. Furthermore, the lack of training data prevents the straightforward application of learning-based tools used in indoor settings. To address these challenges, we propose two novel extensions. First, we develop methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation. In particular, we use a Large Language Model (LLM) to build such an ontology, thus largely reducing the amount of manual effort required. Second, we leverage the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory signals at training time thus reducing the need for labelled data, providing better predictions, and even allowing predicting concepts unseen at training time. We test our approach in a variety of datasets, including indoor, rural, and coastal environments, and show that it leads to a significant increase in the quality of the 3D scene graph generation with sparsely annotated data.


Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

arXiv.org Artificial Intelligence

Abstract semantic 3D scene understanding is a problem of critical importance in robotics. As robots still lack the common-sense knowledge about household objects and locations of an average human, we investigate the use of pre-trained language models to impart common sense for scene understanding. We introduce and compare a wide range of scene classification paradigms that leverage language only (zero-shot, embedding-based, and structured-language) or vision and language (zero-shot and fine-tuned). We find that the best approaches in both categories yield $\sim 70\%$ room classification accuracy, exceeding the performance of pure-vision and graph classifiers. We also find such methods demonstrate notable generalization and transfer capabilities stemming from their use of language.


D-Lite: Navigation-Oriented Compression of 3D Scene Graphs for Multi-Robot Collaboration

arXiv.org Artificial Intelligence

For a multi-robot team that collaboratively explores an unknown environment, it is of vital importance that collected information is efficiently shared among robots in order to support exploration and navigation tasks. Practical constraints of wireless channels, such as limited bandwidth, urge robots to carefully select information to be transmitted. In this paper, we consider the case where environmental information is modeled using a 3D Scene Graph, a hierarchical map representation that describes both geometric and semantic aspects of the environment. Then, we leverage graph-theoretic tools, namely graph spanners, to design greedy algorithms that efficiently compress 3D Scene Graphs with the aim of enabling communication between robots under bandwidth constraints. Our compression algorithms are navigation-oriented in that they are designed to approximately preserve shortest paths between locations of interest, while meeting a user-specified communication budget constraint. The effectiveness of the proposed algorithms is demonstrated in synthetic robot navigation experiments in a realistic simulator. A video abstract is available at https://youtu.be/nKYXU5VC6A8.


PyPose v0.6: The Imperative Programming Interface for Robotics

arXiv.org Artificial Intelligence

PyPose is an open-source library for robot learning. It combines a learning-based approach with physics-based optimization, which enables seamless end-to-end robot learning. It has been used in many tasks due to its meticulously designed application programming interface (API) and efficient implementation. From its initial launch in early 2022, PyPose has experienced significant enhancements, incorporating a wide variety of new features into its platform. To satisfy the growing demand for understanding and utilizing the library and reduce the learning curve of new users, we present the fundamental design principle of the imperative programming interface, and showcase the flexible usage of diverse functionalities and modules using an extremely simple Dubins car example. We also demonstrate that the PyPose can be easily used to navigate a real quadruped robot with a few lines of code.


Optimal Pose and Shape Estimation for Category-level 3D Object Perception

arXiv.org Artificial Intelligence

We consider a category-level perception problem, where one is given 3D sensor data picturing an object of a given category (e.g. a car), and has to reconstruct the pose and shape of the object despite intra-class variability (i.e. different car models have different shapes). We consider an active shape model, where -- for an object category -- we are given a library of potential CAD models describing objects in that category, and we adopt a standard formulation where pose and shape estimation are formulated as a non-convex optimization. Our first contribution is to provide the first certifiably optimal solver for pose and shape estimation. In particular, we show that rotation estimation can be decoupled from the estimation of the object translation and shape, and we demonstrate that (i) the optimal object rotation can be computed via a tight (small-size) semidefinite relaxation, and (ii) the translation and shape parameters can be computed in closed-form given the rotation. Our second contribution is to add an outlier rejection layer to our solver, hence making it robust to a large number of misdetections. Towards this goal, we wrap our optimal solver in a robust estimation scheme based on graduated non-convexity. To further enhance robustness to outliers, we also develop the first graph-theoretic formulation to prune outliers in category-level perception, which removes outliers via convex hull and maximum clique computations; the resulting approach is robust to 70%-90% outliers. Our third contribution is an extensive experimental evaluation. Besides providing an ablation study on a simulated dataset and on the PASCAL3D+ dataset, we combine our solver with a deep-learned keypoint detector, and show that the resulting approach improves over the state of the art in vehicle pose estimation in the ApolloScape datasets.


Optimal and Robust Category-level Perception: Object Pose and Shape Estimation from 2D and 3D Semantic Keypoints

arXiv.org Artificial Intelligence

We consider a category-level perception problem, where one is given 2D or 3D sensor data picturing an object of a given category (e.g., a car), and has to reconstruct the 3D pose and shape of the object despite intra-class variability (i.e., different car models have different shapes). We consider an active shape model, where -- for an object category -- we are given a library of potential CAD models describing objects in that category, and we adopt a standard formulation where pose and shape are estimated from 2D or 3D keypoints via non-convex optimization. Our first contribution is to develop PACE3D* and PACE2D*, the first certifiably optimal solvers for pose and shape estimation using 3D and 2D keypoints, respectively. Both solvers rely on the design of tight (i.e., exact) semidefinite relaxations. Our second contribution is to develop outlier-robust versions of both solvers, named PACE3D# and PACE2D#. Towards this goal, we propose ROBIN, a general graph-theoretic framework to prune outliers, which uses compatibility hypergraphs to model measurements' compatibility. We show that in category-level perception problems these hypergraphs can be built from the winding orders of the keypoints (in 2D) or their convex hulls (in 3D), and many outliers can be filtered out via maximum hyperclique computation. The last contribution is an extensive experimental evaluation. Besides providing an ablation study on simulated datasets and on the PASCAL3D+ dataset, we combine our solver with a deep keypoint detector, and show that PACE3D# improves over the state of the art in vehicle pose estimation in the ApolloScape datasets, and its runtime is compatible with practical applications. We release our code at https://github.com/MIT-SPARK/PACE.