kimera
Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction
Yu, Xuan, Xie, Yuxuan, Liu, Yili, Lu, Haojian, Xiong, Rong, Liao, Yiyi, Wang, Yue
Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: https://yuxuan1206.github.io/panopticrecon_pp/
FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models
Liu, Chuhao, Wang, Ke, Shi, Jieqi, Qiao, Zhijian, Shen, Shaojie
Semantic mapping based on the supervised object detectors is sensitive to image distribution. In real-world environments, the object detection and segmentation performance can lead to a major drop, preventing the use of semantic mapping in a wider domain. On the other hand, the development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution. It provides an opportunity to construct generalizable instance-aware semantic maps. Hence, this work explores how to boost instance-aware semantic mapping from object detection generated from foundation models. We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements. An instance refinement module merges the over-segmented instances caused by inconsistent segmentation. We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instance-aware semantic map. We evaluate the zero-shot performance of our method in ScanNet and SceneNN datasets. Our method achieves 40.3 mean average precision (mAP) on the ScanNet semantic instance segmentation task. It outperforms the traditional semantic mapping method significantly.
Kimera2: Robust and Accurate Metric-Semantic SLAM in the Real World
Abate, Marcus, Chang, Yun, Hughes, Nathan, Carlone, Luca
In particular, we enhance Kimera-VIO, the visual-inertial odometry pipeline powering Kimera, to support better feature tracking, more efficient keyframe selection, and various input modalities (e.g., monocular, stereo, and RGB-D images, as well as wheel odometry). Additionally, Kimera-RPGO and Kimera-PGMO, Kimera's pose-graph optimization backends, are updated to support modern outlier rejection methods --specifically, Graduated-Non-Convexity-- for improved robustness to spurious loop closures. These new features are evaluated extensively on a variety of simulated and real robotic platforms, including drones, quadrupeds, wheeled robots, and simulated self-driving cars. We present comparisons against several state-of-the-art visual-inertial SLAM pipelines and discuss strengths and weaknesses of the new release of Kimera.
ViWiD: Leveraging WiFi for Robust and Resource-Efficient SLAM
Arun, Aditya, Hunter, William, Ayyalasomayajula, Roshan, Bharadia, Dinesh
Recent interest towards autonomous navigation and exploration robots for indoor applications has spurred research into indoor Simultaneous Localization and Mapping (SLAM) robot systems. While most of these SLAM systems use Visual and LiDAR sensors in tandem with an odometry sensor, these odometry sensors drift over time. To combat this drift, Visual SLAM systems deploy compute and memory intensive search algorithms to detect `Loop Closures', which make the trajectory estimate globally consistent. To circumvent these resource (compute and memory) intensive algorithms, we present ViWiD, which integrates WiFi and Visual sensors in a dual-layered system. This dual-layered approach separates the tasks of local and global trajectory estimation making ViWiD resource efficient while achieving on-par or better performance to state-of-the-art Visual SLAM. We demonstrate ViWiD's performance on four datasets, covering over 1500 m of traversed path and show 4.3x and 4x reduction in compute and memory consumption respectively compared to state-of-the-art Visual and Lidar SLAM systems with on par SLAM performance.
Giving Robots Human-Like Perception of Their Physical Environments
Kimera builds a dense 3D semantic mesh of an environment and can track humans in the environment. The figure shows a multi-frame action sequence of a human moving in the scene. "Alexa, go to the kitchen and fetch me a snack" Wouldn't we all appreciate a little help around the house, especially if that help came in the form of a smart, adaptable, uncomplaining robot? Sure, there are the one-trick Roombas of the appliance world. But MIT engineers are envisioning robots more like home helpers, able to follow high-level, Alexa-type commands, such as "Go to the kitchen and fetch me a coffee cup."
Giving robots human-like perception of their physical environments
To carry out such high-level tasks, researchers believe robots will have to be able to perceive their physical environment as humans do. "In order to make any decision in the world, you need to have a mental model of the environment around you," says Luca Carlone, assistant professor of aeronautics and astronautics at MIT. "This is something so effortless for humans. But for robots it's a painfully hard problem, where it's about transforming pixel values that they see through a camera, into an understanding of the world." Now Carlone and his students have developed a representation of spatial perception for robots that is modeled after the way humans perceive and navigate the world. The new model, which they call 3D Dynamic Scene Graphs, enables a robot to quickly generate a 3D map of its surroundings that also includes objects and their semantic labels (a chair versus a table, for instance), as well as people, rooms, walls, and other structures that the robot is likely seeing in its environment.
Alexa, go to the kitchen and fetch me a snack
Wouldn't we all appreciate a little help around the house, especially if that help came in the form of a smart, adaptable, uncomplaining robot? Sure, there are the one-trick Roombas of the appliance world. But MIT engineers are envisioning robots more like home helpers, able to follow high-level, Alexa-type commands, such as "Go to the kitchen and fetch me a coffee cup." To carry out such high-level tasks, researchers believe robots will have to be able to perceive their physical environment as humans do. "In order to make any decision in the world, you need to have a mental model of the environment around you," says Luca Carlone, assistant professor of aeronautics and astronautics at MIT. "This is something so effortless for humans. But for robots it's a painfully hard problem, where it's about transforming pixel values that they see through a camera, into an understanding of the world."
Moving Humanity Forward
Kimera has created the world's first Artificial General Intelligence (AGI). Unlike traditional AI, which is limited to one field or set of tasks, AGI can do virtually anything across any industry. Kimera's vision is to use this technology to move humanity forward. TOKEN SALE IS NOW LIVE: https://kimera.ai/
Kimera Systems ICO – Moving Humanity Forward
AGI is a highly advanced and complex technology. It required years of research and creative thinking to develop Nigel AGI. Kimera wants this technology to be owned by a multitude of individuals and want many of you to understand it. Kimera's AGI is based on the General Theory of Intelligence, which defines intelligence from a quantum physics perspective, not a neuroscience approach. To enable continued general learning, the single algorithm focuses on learning cause and effect by observing reality through user's devices.