carlone
da6ea77475918a3d83c7e49223d453cc-Supplemental.pdf
Intuitively, if thei-th measurement yi is an inlier (i.e., r2 c2β2i), then θi = +1 and the corresponding term in(A1) reduces to least squares; ifyi is an outlier (i.e., r2 > c2β2i), then θi = 1andthecorrespondingtermin(A1)becomesaconstant c2,whencetheoutlierisirrelevant to the optimization. Directly developing the residual function r2(x,yi) = kbi sΠRBik 2 leads to a quartic polynomial (degree 4) ins and R, which is not suitable for moment relaxation because itwouldincrease theminimum relaxation orderκ[14]. T2 for t (the translation is bounded byaknownvalueT). Towards this goal, we introduce the notion of moments, moment matricesandlocalizingmatrices. Given a probability measureµ supported onP R n, its moment of orderα Z n+ is the scalarzα .
Teaching robots to map large environments
A robot searching for workers trapped in a partially collapsed mine shaft must rapidly generate a map of the scene and identify its location within that scene as it navigates the treacherous terrain. Researchers have recently started building powerful machine-learning models to perform this complex task using only images from the robot's onboard cameras, but even the best models can only process a few images at a time. In a real-world disaster where every second counts, a search-and-rescue robot would need to quickly traverse large areas and process thousands of images to complete its mission. To overcome this problem, MIT researchers drew on ideas from both recent artificial intelligence vision models and classical computer vision to develop a new system that can process an arbitrary number of images. Their system accurately generates 3D maps of complicated scenes like a crowded office corridor in a matter of seconds.
Certifiably-Correct Mapping for Safe Navigation Despite Odometry Drift
Agrawal, Devansh R., Kim, Taekyung, Govindjee, Rajiv, Adeshara, Trushant, Yu, Jiangbo, Ravikumar, Anurekha, Panagou, Dimitra
Accurate perception, state estimation and mapping are essential for safe robotic navigation as planners and controllers rely on these components for safety-critical decisions. However, existing mapping approaches often assume perfect pose estimates, an unrealistic assumption that can lead to incorrect obstacle maps and therefore collisions. This paper introduces a framework for certifiably-correct mapping that ensures that the obstacle map correctly classifies obstacle-free regions despite the odometry drift in vision-based localization systems (VIO}/SLAM). By deflating the safe region based on the incremental odometry error at each timestep, we ensure that the map remains accurate and reliable locally around the robot, even as the overall odometry error with respect to the inertial frame grows unbounded. Our contributions include two approaches to modify popular obstacle mapping paradigms, (I) Safe Flight Corridors, and (II) Signed Distance Fields. We formally prove the correctness of both methods, and describe how they integrate with existing planning and control modules. Simulations using the Replica dataset highlight the efficacy of our methods compared to state-of-the-art techniques. Real-world experiments with a robotic rover show that, while baseline methods result in collisions with previously mapped obstacles, the proposed framework enables the rover to safely stop before potential collisions.
Safe and Efficient Estimation for Robotics through the Optimal Use of Resources
In order to operate in and interact with the physical world, robots need to have estimates of the current and future state of the environment. We thus equip robots with sensors and build models and algorithms that, given some measurements, produce estimates of the current or future states. Environments can be unpredictable and sensors are not perfect. Therefore, it is important to both use all information available, and to do so optimally: making sure that we get the best possible answer from the amount of information we have. However, in prevalent research, uncommon sensors, such as sound or radio-frequency signals, are commonly ignored for state estimation; and the most popular solvers employed to produce state estimates are only of local nature, meaning they may produce suboptimal estimates for the typically non-convex estimation problems. My research aims to use resources more optimally, by building on 1) multi-modality: using ubiquitous RF transceivers and microphones to support state estimation, 2) building certifiably optimal solvers and 3) learning and improving adequate models from data.
Kimera2: Robust and Accurate Metric-Semantic SLAM in the Real World
Abate, Marcus, Chang, Yun, Hughes, Nathan, Carlone, Luca
In particular, we enhance Kimera-VIO, the visual-inertial odometry pipeline powering Kimera, to support better feature tracking, more efficient keyframe selection, and various input modalities (e.g., monocular, stereo, and RGB-D images, as well as wheel odometry). Additionally, Kimera-RPGO and Kimera-PGMO, Kimera's pose-graph optimization backends, are updated to support modern outlier rejection methods --specifically, Graduated-Non-Convexity-- for improved robustness to spurious loop closures. These new features are evaluated extensively on a variety of simulated and real robotic platforms, including drones, quadrupeds, wheeled robots, and simulated self-driving cars. We present comparisons against several state-of-the-art visual-inertial SLAM pipelines and discuss strengths and weaknesses of the new release of Kimera.
VERF: Runtime Monitoring of Pose Estimation with Neural Radiance Fields
Maggio, Dominic, Mario, Courtney, Carlone, Luca
We present VERF, a collection of two methods (VERF-PnP and VERF-Light) for providing runtime assurance on the correctness of a camera pose estimate of a monocular camera without relying on direct depth measurements. We leverage the ability of NeRF (Neural Radiance Fields) to render novel RGB perspectives of a scene. We only require as input the camera image whose pose is being estimated, an estimate of the camera pose we want to monitor, and a NeRF model containing the scene pictured by the camera. We can then predict if the pose estimate is within a desired distance from the ground truth and justify our prediction with a level of confidence. VERF-Light does this by rendering a viewpoint with NeRF at the estimated pose and estimating its relative offset to the sensor image up to scale. Since scene scale is unknown, the approach renders another auxiliary image and reasons over the consistency of the optical flows across the three images. VERF-PnP takes a different approach by rendering a stereo pair of images with NeRF and utilizing the Perspective-n-Point (PnP) algorithm. We evaluate both methods on the LLFF dataset, on data from a Unitree A1 quadruped robot, and on data collected from Blue Origin's sub-orbital New Shepard rocket to demonstrate the effectiveness of the proposed pose monitoring method across a range of scene scales. We also show monitoring can be completed in under half a second on a 3090 GPU.
A Probabilistic Relaxation of the Two-Stage Object Pose Estimation Paradigm
Existing object pose estimation methods commonly require a one-to-one point matching step that forces them to be separated into two consecutive stages: visual correspondence detection (e.g., by matching feature descriptors as part of a perception front-end) followed by geometric alignment (e.g., by optimizing a robust estimation objective for pointcloud registration or perspective-n-point). Instead, we propose a matching-free probabilistic formulation with two main benefits: i) it enables unified and concurrent optimization of both visual correspondence and geometric alignment, and ii) it can represent different plausible modes of the entire distribution of likely poses. This in turn allows for a more graceful treatment of geometric perception scenarios where establishing one-to-one matches between points is conceptually ill-defined, such as textureless, symmetrical and/or occluded objects and scenes where the correct pose is uncertain or there are multiple equally valid solutions.
Hydra-Multi: Collaborative Online Construction of 3D Scene Graphs with Multi-Robot Teams
Chang, Yun, Hughes, Nathan, Ray, Aaron, Carlone, Luca
3D scene graphs have recently emerged as an expressive high-level map representation that describes a 3D environment as a layered graph where nodes represent spatial concepts at multiple levels of abstraction (e.g., objects, rooms, buildings) and edges represent relations between concepts (e.g., inclusion, adjacency). This paper describes Hydra-Multi, the first multi-robot spatial perception system capable of constructing a multi-robot 3D scene graph online from sensor data collected by robots in a team. In particular, we develop a centralized system capable of constructing a joint 3D scene graph by taking incremental inputs from multiple robots, effectively finding the relative transforms between the robots' frames, and incorporating loop closure detections to correctly reconcile the scene graph nodes from different robots. We evaluate Hydra-Multi on simulated and real scenarios and show it is able to reconstruct accurate 3D scene graphs online. We also demonstrate Hydra-Multi's capability of supporting heterogeneous teams by fusing different map representations built by robots with different sensor suites.
Trusting Robots to Navigate New Spaces
When Vasileios Tzoumas, a research scientist at the Massachusetts Institute of Technology (MIT), visits a new city, he likes to explore by going for a run. And sometimes he gets lost. A few years ago, on a long run while in Osaka for a conference, the inevitable happened. But then Tzoumas spotted a 7-Eleven he remembered passing soon after leaving his hotel. This recognition allowed him to mentally "close the loop," to connect the loose end of his trajectory to someplace he knew and was sure about, thus solidifying his mental map and allowing him to make his way back to the hotel.
Giving Robots Human-Like Perception of Their Physical Environments
Kimera builds a dense 3D semantic mesh of an environment and can track humans in the environment. The figure shows a multi-frame action sequence of a human moving in the scene. "Alexa, go to the kitchen and fetch me a snack" Wouldn't we all appreciate a little help around the house, especially if that help came in the form of a smart, adaptable, uncomplaining robot? Sure, there are the one-trick Roombas of the appliance world. But MIT engineers are envisioning robots more like home helpers, able to follow high-level, Alexa-type commands, such as "Go to the kitchen and fetch me a coffee cup."