Huang, Guoquan
Online Language Splatting
Katragadda, Saimouli, Wu, Cho-Ying, Guo, Yuliang, Huang, Xinyu, Huang, Guoquan, Ren, Liu
To enable AI agents to interact seamlessly with both humans and 3D environments, they must not only perceive the 3D world accurately but also align human language with 3D spatial representations. While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting (GS), these approaches rely on computationally intensive offline preprocessing of language features for each input image, limiting adaptability to new environments. In this work, we introduce Online Language Splatting, the first framework to achieve online, near real-time, open-vocabulary language mapping within a 3DGS-SLAM system without requiring pre-generated language features. The key challenge lies in efficiently fusing high-dimensional language features into 3D representations while balancing the computation speed, memory usage, rendering quality and open-vocabulary capability. To this end, we innovatively design: (1) a high-resolution CLIP embedding module capable of generating detailed language feature maps in 18ms per frame, (2) a two-stage online auto-encoder that compresses 768-dimensional CLIP features to 15 dimensions while preserving open-vocabulary capabilities, and (3) a color-language disentangled optimization approach to improve rendering quality. Experimental results show that our online method not only surpasses the state-of-the-art offline methods in accuracy but also achieves more than 40x efficiency boost, demonstrating the potential for dynamic and interactive AI applications.
Robust 4D Radar-aided Inertial Navigation for Aerial Vehicles
Zhu, Jinwen, Hu, Jun, Zhao, Xudong, Lang, Xiaoming, Mao, Yinian, Huang, Guoquan
While LiDAR and cameras are becoming ubiquitous for unmanned aerial vehicles (UAVs) but can be ineffective in challenging environments, 4D millimeter-wave (MMW) radars that can provide robust 3D ranging and Doppler velocity measurements are less exploited for aerial navigation. In this paper, we develop an efficient and robust error-state Kalman filter (ESKF)-based radar-inertial navigation for UAVs. The key idea of the proposed approach is the point-to-distribution radar scan matching to provide motion constraints with proper uncertainty qualification, which are used to update the navigation states in a tightly coupled manner, along with the Doppler velocity measurements. Moreover, we propose a robust keyframe-based matching scheme against the prior map (if available) to bound the accumulated navigation errors and thus provide a radar-based global localization solution with high accuracy. Extensive real-world experimental validations have demonstrated that the proposed radar-aided inertial navigation outperforms state-of-the-art methods in both accuracy and robustness.
Visual-Inertial SLAM as Simple as A, B, VINS
Merrill, Nathaniel, Huang, Guoquan
We present AB-VINS, a different kind of visual-inertial SLAM system. Unlike most VINS systems which only use hand-crafted techniques, AB-VINS makes use of three different deep networks. Instead of estimating sparse feature positions, AB-VINS only estimates the scale and bias parameters (a and b) of monocular depth maps, as well as other terms to correct the depth using multi-view information which results in a compressed feature state. Despite being an optimization-based system, the main VIO thread of AB-VINS surpasses the efficiency of a state-of-the-art filter-based method while also providing dense depth. While state-of-the-art loop-closing SLAM systems have to relinearize a number of variables linear the number of keyframes, AB-VINS can perform loop closures while only affecting a constant number of variables. This is due to a novel data structure called the memory tree, in which the keyframe poses are defined relative to each other rather than all in one global frame, allowing for all but a few states to be fixed. AB-VINS is not as accurate as state-of-the-art VINS systems, but it is shown through careful experimentation to be more robust.
Square-Root Inverse Filter-based GNSS-Visual-Inertial Navigation
Hu, Jun, Lang, Xiaoming, Zhang, Feng, Mao, Yinian, Huang, Guoquan
While Global Navigation Satellite System (GNSS) is often used to provide global positioning if available, its intermittency and/or inaccuracy calls for fusion with other sensors. In this paper, we develop a novel GNSS-Visual-Inertial Navigation System (GVINS) that fuses visual, inertial, and raw GNSS measurements within the square-root inverse sliding window filtering (SRI-SWF) framework in a tightly coupled fashion, which thus is termed SRI-GVINS. In particular, for the first time, we deeply fuse the GNSS pseudorange, Doppler shift, single-differenced pseudorange, and double-differenced carrier phase measurements, along with the visual-inertial measurements. Inherited from the SRI-SWF, the proposed SRI-GVINS gains significant numerical stability and computational efficiency over the start-of-the-art methods. Additionally, we propose to use a filter to sequentially initialize the reference frame transformation till converges, rather than collecting measurements for batch optimization. We also perform online calibration of GNSS-IMU extrinsic parameters to mitigate the possible extrinsic parameter degradation. The proposed SRI-GVINS is extensively evaluated on our own collected UAV datasets and the results demonstrate that the proposed method is able to suppress VIO drift in real-time and also show the effectiveness of online GNSS-IMU extrinsic calibration. The experimental validation on the public datasets further reveals that the proposed SRI-GVINS outperforms the state-of-the-art methods in terms of both accuracy and efficiency.
General Place Recognition Survey: Towards Real-World Autonomy
Yin, Peng, Jiao, Jianhao, Zhao, Shiqi, Xu, Lingyun, Huang, Guoquan, Choset, Howie, Scherer, Sebastian, Han, Jianda
In the realm of robotics, the quest for achieving real-world autonomy, capable of executing large-scale and long-term operations, has positioned place recognition (PR) as a cornerstone technology. Despite the PR community's remarkable strides over the past two decades, garnering attention from fields like computer vision and robotics, the development of PR methods that sufficiently support real-world robotic systems remains a challenge. This paper aims to bridge this gap by highlighting the crucial role of PR within the framework of Simultaneous Localization and Mapping (SLAM) 2.0. This new phase in robotic navigation calls for scalable, adaptable, and efficient PR solutions by integrating advanced artificial intelligence (AI) technologies. For this goal, we provide a comprehensive review of the current state-of-the-art (SOTA) advancements in PR, alongside the remaining challenges, and underscore its broad applications in robotics. This paper begins with an exploration of PR's formulation and key research challenges. We extensively review literature, focusing on related methods on place representation and solutions to various PR challenges. Applications showcasing PR's potential in robotics, key PR datasets, and open-source libraries are discussed. We also emphasizes our open-source package, aimed at new development and benchmark for general PR. We conclude with a discussion on PR's future directions, accompanied by a summary of the literature covered and access to our open-source library, available to the robotics community at: https://github.com/MetaSLAM/GPRS.
Multi-Visual-Inertial System: Analysis, Calibration and Estimation
Yang, Yulin, Geneva, Patrick, Huang, Guoquan
Regarding state estimation, many works have explored The combination of cameras and inertial measurement units to use multiple vision sensors for better VINS performance (IMUs) have become prevalent in autonomous vehicles and (Leutenegger et al. 2015; Usenko et al. 2016; Paul mobile devices in the recent decade due to their decrease in et al. 2017; Sun et al. 2018; Kuo et al. 2020; Campos cost and complementary sensing nature. A camera provides et al. 2021; Fu et al. 2021). In particular, Leutenegger texture-rich images of 2 degree-of-freedom (DoF) bearing et al. (2015), Usenko et al. (2016) and Fu et al. (2021) observations to environmental features, while a 6-axis IMU have shown that stereo camera or multiple cameras can typically consists of a gyroscope and an accelerometer achieve better pose accuracy or lower the uncertainties which measures high-frequency angular velocity and linear of IMU-Camera calibration. Only a few works recently acceleration, respectively. This has lead to a significant investigate multiple inertial sensor fusion for VINS (Kim progress of developing visual-inertial navigation system et al. 2017; Eckenhoff et al. 2019b; Zhang et al. 2020; (VINS) algorithms focusing on efficient and accurate pose Wu et al. 2023; Faizullin and Ferrer 2023), showing that estimation (Huang 2019). While many works have shown the system robustness and pose accuracy can be improved accurate estimation for the minimal sensing case of a single by fusing additional IMUs. For optimal fusion of multiple camera and IMU (Mourikis and Roumeliotis 2007; Bloesch asynchronous visual and inertial sensors for MVIS, et al. 2015; Forster et al. 2016; Qin et al. 2018; Geneva et al. it is crucial to provide accurate full-parameter calibration 2020), it is known that the inclusion of additional sensors for these sensors, which include: (i) IMU-IMU/camera can provide improved accuracy due to additional information rigid transformation, (ii) IMU-IMU/camera time offset, (iii) and robustness to single sensor failure cases (Paul et al.
MINS: Efficient and Robust Multisensor-aided Inertial Navigation System
Lee, Woosik, Geneva, Patrick, Chen, Chuchu, Huang, Guoquan
Robust multisensor fusion of multi-modal measurements such as IMUs, wheel encoders, cameras, LiDARs, and GPS holds great potential due to its innate ability to improve resilience to sensor failures and measurement outliers, thereby enabling robust autonomy. To the best of our knowledge, this work is among the first to develop a consistent tightly-coupled Multisensor-aided Inertial Navigation System (MINS) that is capable of fusing the most common navigation sensors in an efficient filtering framework, by addressing the particular challenges of computational complexity, sensor asynchronicity, and intra-sensor calibration. In particular, we propose a consistent high-order on-manifold interpolation scheme to enable efficient asynchronous sensor fusion and state management strategy (i.e. dynamic cloning). The proposed dynamic cloning leverages motion-induced information to adaptively select interpolation orders to control computational complexity while minimizing trajectory representation errors. We perform online intrinsic and extrinsic (spatiotemporal) calibration of all onboard sensors to compensate for poor prior calibration and/or degraded calibration varying over time. Additionally, we develop an initialization method with only proprioceptive measurements of IMU and wheel encoders, instead of exteroceptive sensors, which is shown to be less affected by the environment and more robust in highly dynamic scenarios. We extensively validate the proposed MINS in simulations and large-scale challenging real-world datasets, outperforming the existing state-of-the-art methods, in terms of localization accuracy, consistency, and computation efficiency. We have also open-sourced our algorithm, simulator, and evaluation toolbox for the benefit of the community: https://github.com/rpng/mins.
NeRF-VINS: A Real-time Neural Radiance Field Map-based Visual-Inertial Navigation System
Katragadda, Saimouli, Lee, Woosik, Peng, Yuxiang, Geneva, Patrick, Chen, Chuchu, Guo, Chao, Li, Mingyang, Huang, Guoquan
Achieving accurate, efficient, and consistent localization within an a priori environment map remains a fundamental challenge in robotics and computer vision. Conventional map-based keyframe localization often suffers from sub-optimal viewpoints due to limited field of view (FOV), thus degrading its performance. To address this issue, in this paper, we design a real-time tightly-coupled Neural Radiance Fields (NeRF)-aided visual-inertial navigation system (VINS), termed NeRF-VINS. By effectively leveraging NeRF's potential to synthesize novel views, essential for addressing limited viewpoints, the proposed NeRF-VINS optimally fuses IMU and monocular image measurements along with synthetically rendered images within an efficient filter-based framework. This tightly coupled integration enables 3D motion tracking with bounded error. We extensively compare the proposed NeRF-VINS against the state-of-the-art methods that use prior map information, which is shown to achieve superior performance. We also demonstrate the proposed method is able to perform real-time estimation at 15 Hz, on a resource-constrained Jetson AGX Orin embedded platform with impressive accuracy.