AITopics

3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S$^2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S$^2$-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.

artificial intelligence, natural language, visual grounding, (15 more...)

2512.01223

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Ni, Chaojun, Chen, Cheng, Wang, Xiaofeng, Zhu, Zheng, Zheng, Wenzhao, Wang, Boyuan, Chen, Tianrun, Zhao, Guosheng, Li, Haoyun, Dong, Zhehao, Zhang, Qiang, Ye, Yun, Wang, Yang, Huang, Guan, Mei, Wenjun

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

arxiv preprint arxiv, machine learning, natural language, (19 more...)

2512.00903

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
(2 more...)

Thompson, Jacob, Garcia-Lopez, Emiliano, Bisk, Yonatan

REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories

Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack this fundamental spatial reasoning capability, a critical limitation for embodied applications. To demonstrate these limitations and drive research, we introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for long-horizon embodied spatial reasoning. REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints. Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans. These findings highlight challenges MLLMs face in developing robust spatial representations from sequential visual input. Consequently, REM provides targeted metrics and diagnostics to foster improved spatial understanding in future models.

large language model, machine learning, trajectory, (18 more...)

2512.00736

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Qin, Yiming, Wei, Bomin, Ge, Jiaxin, Kallidromitis, Konstantinos, Fu, Stephanie, Darrell, Trevor, Wang, XuDong

Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.

large language model, machine learning, natural language, (21 more...)

2511.19418

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Erzibengoa, Julen, Gómez-Omella, Meritxell, Goienetxea, Izaro

IberFire -- a detailed creation of a spatio-temporal dataset for wildfire risk assessment in Spain

Wildfires pose a threat to ecosystems, economies and public safety, particularly in Mediterranean regions such as Spain. Accurate predictive models require high-resolution spatio-temporal data to capture complex dynamics of environmental and human factors. To address the scarcity of fine-grained wildfire datasets in Spain, we introduce IberFire: a spatio-temporal dataset with 1 km x 1 km x 1-day resolution, covering mainland Spain and the Balearic Islands from December 2007 to December 2024. IberFire integrates 120 features across eight categories: auxiliary data, fire history, geography, topography, meteorology, vegetation indices, human activity and land cover. All features and processing rely on open-access data and tools, with a publicly available codebase ensuring transparency and applicability. IberFire offers enhanced spatial granularity and feature diversity compared to existing European datasets, and provides a reproducible framework. It supports advanced wildfire risk modelling via Machine Learning and Deep Learning, facilitates climate trend analysis, and informs fire prevention and land management strategies. The dataset is freely available on Zenodo to promote open research and collaboration.

artificial intelligence, datacube, machine learning, (18 more...)

2505.00837

Country: Europe > Spain > Balearic Islands (0.24)

Genre: Research Report (1.00)

Industry:

Government (0.68)
Law Enforcement & Public Safety (0.49)
Food & Agriculture > Agriculture (0.48)
(2 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Software (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Geometrically-Constrained Agent for Spatial Reasoning

Chen, Zeren, Lu, Xiaoya, Zheng, Zhijie, Li, Pengrui, He, Lehan, Zhou, Yijin, Shao, Jing, Zhuang, Bohan, Sheng, Lu

Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.

artificial intelligence, machine learning, spatial reasoning, (18 more...)

2511.22659

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

Wang, Bo, Lin, Jiehong, Liu, Chenzhi, Hu, Xinting, Yu, Yifei, Liu, Tianjia, Wang, Zhongrui, Qi, Xiaojuan

We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.

large language model, machine learning, natural language, (16 more...)

2511.22609

Genre: Workflow (0.94)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.56)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)

GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing

Yang, Xiaoyin

Eye tracking has become increasingly important in virtual and augmented reality applications; however, the current gaze accuracy falls short of meeting the requirements for spatial computing. We designed a gaze collection framework and utilized high-precision equipment to gather the first precise benchmark dataset, GazeTrack, encompassing diverse ethnicities, ages, and visual acuity conditions for pupil localization and gaze tracking. We propose a novel shape error regularization method to constrain pupil ellipse fitting and train on open-source datasets, enhancing semantic segmentation and pupil position prediction accuracy. Additionally, we invent a novel coordinate transformation method similar to paper unfolding to accurately predict gaze vectors on the GazeTrack dataset. Finally, we built a gaze vector generation model that achieves reduced gaze angle error with lower computational complexity compared to other methods.Please refer to our project page for more details: https://github.com/---(please

artificial intelligence, machine learning, spatial reasoning, (21 more...)

2511.22607

Country: Europe (1.00)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area (0.49)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.61)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.49)

Mohammadshirazi, Ahmad, Neogi, Pinaki Prasad Guha, Kulshrestha, Dheeraj, Ramnath, Rajiv

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

Document visual question answering (DocVQA) requires models to jointly reason over textual content and spatial layout, yet current systems exhibit a sharp accuracy--efficiency trade-off: large teacher models achieve strong grounding but are too expensive for deployment, while compact students suffer substantial drops in localization performance. We propose DocVAL, a validated chain-of-thought distillation framework that transfers the spatial reasoning ability of a large teacher into a deployable student VLM through three key components: (1) teacher supervision with validation-time text detection to filter and denoise training signals, (2) a multi-module validator (VAL) that enforces answer correctness and geometric consistency while producing fine-grained, pixel-level error feedback, and (3) a two-stage student training scheme that first learns from validated CoT traces and then undergoes iterative refinement driven by VAL feedback. Our student (Gemma-3 12B) achieves 91.4\% ANLS and 82.4\% mAP on DocVQA as a pure VLM requiring no text detection or OCR at inference. Extensive ablations demonstrate that validated feedback contributes 6.3 mAP gain and iterative refinement accounts for 9.7 mAP improvement. We release 95k high-quality, validator-verified CoT traces to advance spatial reasoning research in document understanding.

large language model, machine learning, natural language, (19 more...)

2511.22521

Country: North America > United States > Ohio (0.15)

Genre: Research Report (0.50)

Industry: Education (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.91)

Ofuchi, Ryosuke, Toda, Yuichiro, Masuyama, Naoki, Matsuno, Takayuki

MLATC: Fast Hierarchical Topological Mapping from 3D LiDAR Point Clouds Based on Adaptive Resonance Theory

This paper addresses the problem of building global topological maps from 3D LiDAR point clouds for autonomous mobile robots operating in large-scale, dynamic, and unknown environments. Adaptive Resonance Theory-based Topological Clustering with Different Topologies (ATC-DT) builds global topological maps represented as graphs while mitigating catastrophic forgetting during sequential processing. However, its winner selection mechanism relies on an exhaustive nearest-neighbor search over all existing nodes, leading to scalability limitations as the map grows. To address this challenge, we propose a hierarchical extension called Multi-Layer ATC (MLATC). MLATC organizes nodes into a hierarchy, enabling the nearest-neighbor search to proceed from coarse to fine resolutions, thereby drastically reducing the number of distance evaluations per query. The number of layers is not fixed in advance. MLATC employs an adaptive layer addition mechanism that automatically deepens the hierarchy when lower layers become saturated, keeping the number of user-defined hyperparameters low. Simulation experiments on synthetic large-scale environments show that MLATC accelerates topological map building compared to the original ATC-DT and exhibits a sublinear, approximately logarithmic scaling of search time with respect to the number of nodes. Experiments on campus-scale real-world LiDAR datasets confirm that MLATC maintains a millisecond-level per-frame runtime and enables real-time global topological map building in large-scale environments, significantly outperforming the original ATC-DT in terms of computational efficiency.

artificial intelligence, node, spatial reasoning, (17 more...)

2511.22238

Country:

Europe (0.67)
Asia > Japan (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.93)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.68)