Efficient and Accurate Downfacing Visual Inertial Odometry
Kühne, Jonas, Vogt, Christian, Magno, Michele, Benini, Luca
–arXiv.org Artificial Intelligence
This article has been accepted for publication in the IEEE Internet of Things Journal (IoT -J). Personal use of this material is permitted. Abstract--Visual Inertial Odometry (VIO) is a widely used computer vision method that determines an agent's movement through a camera and an IMU sensor . This paper presents an efficient and accurate VIO pipeline optimized for applications on micro-and nano-UA Vs. The proposed design incorporates state-of-the-art feature detection and tracking methods (SuperPoint, PX4FLOW, ORB), all optimized and quantized for emerging RISC-V-based ultra-low-power parallel systems on chips (SoCs). Furthermore, by employing a rigid body motion model, the pipeline reduces estimation errors and achieves improved accuracy in planar motion scenarios. The pipeline's suitability for real-time VIO is assessed on an ultra-low-power SoC in terms of compute requirements and tracking accuracy after quantization. The pipeline, including the three feature tracking methods, was implemented on the SoC for real-world validation. This design bridges the gap between high-accuracy VIO pipelines that are traditionally run on computationally powerful systems and lightweight implementations suitable for microcontrollers. The optimized pipeline on the GAP9 low-power SoC demonstrates an average reduction in RMSE of up to a factor of 3.65x over the baseline pipeline when using the ORB feature tracker . The analysis of the computational complexity of the feature trackers further shows that PX4FLOW achieves on-par tracking accuracy with ORB at a lower runtime for movement speeds below 24 pixels/frame. ISUAL Inertial Odometry (VIO) describes the process of determining an agent's movement through the use of camera and Inertial Measurement Unit (IMU) data [1]. Cameras are used in pure Visual Odometry (VO) to generate a movement estimate from one frame to another by considering the displacement of features or brightness patches between camera images [2]. While stereo VO (i.e., using two cameras) can estimate metric depth information through extrinsic This work was supported by the Swiss National Science Foundation's TinyTrainer project under Grant number 207913.
arXiv.org Artificial Intelligence
Sep-15-2025
- Country:
- Asia > Middle East
- Republic of Türkiye > Karaman Province > Karaman (0.04)
- Europe
- Austria > Styria
- Graz (0.04)
- Italy > Emilia-Romagna
- Metropolitan City of Bologna > Bologna (0.05)
- Sweden (0.04)
- Switzerland > Zürich
- Zürich (0.15)
- Austria > Styria
- North America > United States
- California > Santa Clara County > Stanford (0.04)
- Asia > Middle East
- Genre:
- Research Report (0.50)
- Industry:
- Information Technology (0.88)
- Semiconductors & Electronics (1.00)
- Technology:
- Information Technology
- Architecture (1.00)
- Artificial Intelligence
- Machine Learning > Neural Networks (0.46)
- Robots > Autonomous Vehicles (0.46)
- Vision (1.00)
- Information Technology