Sui, Wei
GeoFlow-SLAM: A Robust Tightly-Coupled RGBD-Inertial Fusion SLAM for Dynamic Legged Robotics
Xiao, Tingyang, Zhou, Xiaolin, Liu, Liu, Sui, Wei, Feng, Wei, Qiu, Jiaxiong, Wang, Xinjie, Su, Zhizhong
This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-inertial SLAM for legged robots operating in highly dynamic environments.By integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges:feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less scenes.Specifically, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at https://github.com/NSN-Hello/GeoFlow-SLAM
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
He, Yonghao, Su, Hu, Yu, Haiyong, Yang, Cong, Sui, Wei, Wang, Cong, Liu, Song
Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSOD-S model achieves a Fixed AP of $26.7\%$, compared to $26.2\%$ for YOLO-World-v1-S and $22.7\%$ for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is $57.1\%$ higher than YOLO-World-v1-S and $29.6\%$ higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.
Gyroscope-Assisted Motion Deblurring Network
Luan, Simin, Yang, Cong, Boukhers, Zeyd, Qin, Xue, Cheng, Dongfeng, Sui, Wei, Li, Zhijun
Image research has shown substantial attention in deblurring networks in recent years. Yet, their practical usage in real-world deblurring, especially motion blur, remains limited due to the lack of pixel-aligned training triplets (background, blurred image, and blur heat map) and restricted information inherent in blurred images. This paper presents a simple yet efficient framework to synthetic and restore motion blur images using Inertial Measurement Unit (IMU) data. Notably, the framework includes a strategy for training triplet generation, and a Gyroscope-Aided Motion Deblurring (GAMD) network for blurred image restoration. The rationale is that through harnessing IMU data, we can determine the transformation of the camera pose during the image exposure phase, facilitating the deduction of the motion trajectory (aka. blur trajectory) for each point inside the three-dimensional space. Thus, the synthetic triplets using our strategy are inherently close to natural motion blur, strictly pixel-aligned, and mass-producible. Through comprehensive experiments, we demonstrate the advantages of the proposed framework: only two-pixel errors between our synthetic and real-world blur trajectories, a marked improvement (around 33.17%) of the state-of-the-art deblurring method MIMO on Peak Signal-to-Noise Ratio (PSNR).
Towards Accurate Ground Plane Normal Estimation from Ego-Motion
Zhang, Jiaxin, Sui, Wei, Zhang, Qian, Chen, Tao, Yang, Cong
In this paper, we introduce a novel approach for ground plane normal estimation of wheeled vehicles. In practice, the ground plane is dynamically changed due to braking and unstable road surface. As a result, the vehicle pose, especially the pitch angle, is oscillating from subtle to obvious. Thus, estimating ground plane normal is meaningful since it can be encoded to improve the robustness of various autonomous driving tasks (e.g., 3D object detection, road surface reconstruction, and trajectory planning). Our proposed method only uses odometry as input and estimates accurate ground plane normal vectors in real time. Particularly, it fully utilizes the underlying connection between the ego pose odometry (ego-motion) and its nearby ground plane. Built on that, an Invariant Extended Kalman Filter (IEKF) is designed to estimate the normal vector in the sensor's coordinate. Thus, our proposed method is simple yet efficient and supports both camera- and inertial-based odometry algorithms. Its usability and the marked improvement of robustness are validated through multiple experiments on public datasets. For instance, we achieve state-of-the-art accuracy on KITTI dataset with the estimated vector error of 0.39{\deg}. Our code is available at github.com/manymuch/ground_normal_filter.
Monocular Road Planar Parallax Estimation
Yuan, Haobo, Chen, Teng, Sui, Wei, Xie, Jiafeng, Zhang, Lefei, Li, Yuan, Zhang, Qian
Estimating the 3D structure of the drivable surface and surrounding environment is a crucial task for assisted and autonomous driving. It is commonly solved either by using expensive 3D sensors such as LiDAR or directly predicting the depth of points via deep learning. Instead of following existing methodologies, we propose Road Planar Parallax Attention Network (RPANet), a new deep neural network for 3D sensing from monocular image sequences based on planar parallax, which takes full advantage of the commonly seen road plane geometry in driving scenes. RPANet takes a pair of images aligned by the homography of the road plane as input and outputs a $\gamma$ map for 3D reconstruction. Beyond estimating the depth or height, the $\gamma$ map has a potential to construct a two-dimensional transformation between two consecutive frames while can be easily derived to depth or height. By warping the consecutive frames using the road plane as a reference, the 3D structure can be estimated from the planar parallax and the residual image displacements. Furthermore, to make the network better perceive the displacements caused by planar parallax, we introduce a novel cross-attention module. We sample data from the Waymo Open Dataset and construct data related to planar parallax. Comprehensive experiments are conducted on the sampled dataset to demonstrate the 3D reconstruction accuracy of our approach in challenging scenarios.