SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes
Lee, Jungho, Lee, Minhyeok, Yang, Sunghun, Kang, Minseok, Lee, Sangyoun
–arXiv.org Artificial Intelligence
3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a significant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value Decomposition (SVD) step. This eliminates the need for the Iteratively Reweighted Least Squares (IRLS) optimization commonly used in prior work, leading to substantial speed-ups. We evaluate SwiftVGGT on multiple datasets and show that it achieves state-of-the-art reconstruction quality while requiring only 33% of the inference time of recent VGGT-based large-scale reconstruction approaches.
arXiv.org Artificial Intelligence
Nov-25-2025
- Country:
- Asia
- China > Guangxi Province
- Nanning (0.04)
- Japan > Honshū
- Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- China > Guangxi Province
- North America > United States (0.14)
- Asia
- Genre:
- Research Report > New Finding (0.93)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Representation & Reasoning (0.68)
- Robots (0.68)
- Vision (1.00)
- Information Technology > Artificial Intelligence