AITopics | monocular

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos

Neural Information Processing SystemsJun-11-2026, 19:35:02 GMT

We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying environments with varying object lifespans. We proposed a novel density control strategy in training, which enables our 4DGT to handle longer space-time input. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can outperform prior Gaussian-based networks significantly in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos.

artificial intelligence, name change, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.41)

Add feedback

370fa2e691f57eb319bc263a07dad4a5-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 04:04:19 GMT

category, detection performance, sun rgb-d dataset, (10 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County > Palo Alto (0.05)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

89fcd07f20b6785b92134bd6c1d0fa42-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 18:26:11 GMT

arxiv preprint arxiv, dataset, slam system, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

Neural Information Processing SystemsDec-24-2025, 10:33:41 GMT

We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SLAM consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. DROID-SLAM is accurate, achieving large improvements over prior work, and robust, suffering from substantially fewer catastrophic failures. Despite training on monocular video, it can leverage stereo or RGB-D video to achieve improved performance at test time.

deep visual slam, droid-slam, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.63)

Add feedback

370fa2e691f57eb319bc263a07dad4a5-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 11:02:52 GMT

category, detection performance, sun rgb-d dataset, (10 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County > Palo Alto (0.05)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

VROOM - Visual Reconstruction over Onboard Multiview

Yadav, Yajat, Bharadwaj, Varun, Korrapati, Jathin, Baranwal, Tanish

arXiv.org Artificial IntelligenceAug-26-2025

W e introduce VROOM, a system for reconstructing 3D models of F ormula 1 circuits using only onboard camera footage from racecars. Leveraging video data from the 2023 Monaco Grand Prix, we address video challenges such as high-speed motion and sharp cuts in camera frames. Our pipeline analyzes different methods such as DROID-SLAM, AnyCam, and Monst3r and combines preprocessing techniques such as different methods of masking, temporal chunking, and resolution scaling to account for dynamic motion and computational constraints. W e show that Vroom is able to partially recover track and vehicle trajectories in complex environments. These findings indicate the feasibility of using onboard video for scalable 4D reconstruction in real-world settings.

artificial intelligence, reconstruction, video, (16 more...)

arXiv.org Artificial Intelligence

2508.17172

Country: Europe > Monaco (0.26)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Sports > Motorsports > Formula One (0.69)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

89fcd07f20b6785b92134bd6c1d0fa42-Paper.pdf

Neural Information Processing SystemsAug-15-2025, 18:11:00 GMT

artificial intelligence, deep learning, machine learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MonoSLAM: Robust Monocular SLAM with Global Structure Optimization

Jiang, Bingzheng, Wang, Jiayuan, Ding, Han, Zhu, Lijun

arXiv.org Artificial IntelligenceMar-12-2025

This paper presents a robust monocular visual SLAM system that simultaneously utilizes point, line, and vanishing point features for accurate camera pose estimation and mapping. To address the critical challenge of achieving reliable localization in low-texture environments, where traditional point-based systems often fail due to insufficient visual features, we introduce a novel approach leveraging Global Primitives structural information to improve the system's robustness and accuracy performance. Our key innovation lies in constructing vanishing points from line features and proposing a weighted fusion strategy to build Global Primitives in the world coordinate system. This strategy associates multiple frames with non-overlapping regions and formulates a multi-frame reprojection error optimization, significantly improving tracking accuracy in texture-scarce scenarios. Evaluations on various datasets show that our system outperforms state-of-the-art methods in trajectory precision, particularly in challenging environments.

denote, line feature, optimization, (14 more...)

arXiv.org Artificial Intelligence

2503.09296

Country:

Asia > China > Hubei Province > Wuhan (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report > Promising Solution (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.34)

Add feedback

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

Neural Information Processing SystemsJan-14-2025, 15:58:30 GMT

We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SLAM consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. DROID-SLAM is accurate, achieving large improvements over prior work, and robust, suffering from substantially fewer catastrophic failures. Despite training on monocular video, it can leverage stereo or RGB-D video to achieve improved performance at test time.

deep visual slam, droid-slam, rgb-d camera, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

Add feedback

VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM in Large Scenes

Wu, Ke, Zhang, Zicheng, Tie, Muer, Ai, Ziqing, Gan, Zhongxue, Ding, Wenchao

arXiv.org Artificial IntelligenceJan-14-2025

VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework designed for large scenes. The framework comprises four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO Front End, RGB frames are processed through dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. Based on this output, the mapping module incrementally constructs and maintains a 2D Gaussian map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer, Score Manager, and Pose Refinement, which collectively improve mapping speed and localization accuracy. This enables the SLAM system to handle large-scale urban environments with up to 50 million Gaussian ellipsoids. To ensure global consistency in large-scale scenes, we design a Loop Closure module, which innovatively leverages the Novel View Synthesis (NVS) capabilities of Gaussian Splatting for loop closure detection and correction of the Gaussian map. Additionally, we propose a Dynamic Eraser to address the inevitable presence of dynamic objects in real-world outdoor scenes. Extensive evaluations in indoor and outdoor environments demonstrate that our approach achieves localization performance on par with Visual-Inertial Odometry while surpassing recent GS/NeRF SLAM methods. It also significantly outperforms all existing methods in terms of mapping and rendering quality. Furthermore, we developed a mobile app and verified that our framework can generate high-quality Gaussian maps in real time using only a smartphone camera and a low-frequency IMU sensor. To the best of our knowledge, VINGS-Mono is the first monocular Gaussian SLAM method capable of operating in outdoor environments and supporting kilometer-scale large scenes.

artificial intelligence, gaussian, optimization problem, (17 more...)

arXiv.org Artificial Intelligence

2501.08286

Country: