AITopics | Spatial Reasoning

Collaborating Authors

Spatial Reasoning

News Overviews Instructional Materials AI-Alerts Classics

BrainStratify: Coarse-to-Fine Disentanglement of Intracranial Neural Dynamics

Zheng, Hui, Wang, Hai-Teng, Jing, Yi-Tao, Lin, Pei-Yang, Zhao, Han-Qing, Chen, Wei, Wei, Peng-Hu, Shan, Yong-Zhi, Zhao, Guo-Guang, Liu, Yun-Zhe

arXiv.org Artificial IntelligenceMay-28-2025

Decoding speech directly from neural activity is a central goal in brain-computer interface (BCI) research. In recent years, exciting advances have been made through the growing use of intracranial field potential recordings, such as stereo-ElectroEncephaloGraphy (sEEG) and ElectroCorticoGraphy (ECoG). These neural signals capture rich population-level activity but present key challenges: (i) task-relevant neural signals are sparsely distributed across sEEG electrodes, and (ii) they are often entangled with task-irrelevant neural signals in both sEEG and ECoG. To address these challenges, we introduce a unified Coarse-to-Fine neural disentanglement framework, BrainStratify, which includes (i) identifying functional groups through spatial-context-guided temporal-spatial modeling, and (ii) disentangling distinct neural dynamics within the target functional group using Decoupled Product Quantization (DPQ). We evaluate BrainStratify on two open-source sEEG datasets and one (epidural) ECoG dataset, spanning tasks like vocal production and speech perception. Extensive experiments show that BrainStratify, as a unified framework for decoding speech from intracranial neural signals, significantly outperforms previous decoding methods. Overall, by combining data-driven stratification with neuroscience-inspired modularity, BrainStratify offers a robust and interpretable solution for speech decoding from intracranial recordings.

artificial intelligence, brainstra, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.2048

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.88)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Neuroscience (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.34)

Add feedback

SpatialLLM: From Multi-modality Data to Urban Spatial Intelligence

Chen, Jiabin, Wang, Haiping, Li, Jinpeng, Liu, Yuan, Dong, Zhen, Yang, Bisheng

arXiv.org Artificial IntelligenceMay-28-2025

We propose SpatialLLM, a novel approach advancing spatial intelligence tasks in complex urban scenes. Unlike previous methods requiring geographic analysis tools or domain expertise, SpatialLLM is a unified language model directly addressing various spatial intelligence tasks without any training, fine-tuning, or expert intervention. The core of SpatialLLM lies in constructing detailed and structured scene descriptions from raw spatial data to prompt pre-trained LLMs for scene-based analysis. Extensive experiments show that, with our designs, pretrained LLMs can accurately perceive spatial distribution information and enable zero-shot execution of advanced spatial intelligence tasks, including urban planning, ecological analysis, traffic management, etc. We argue that multi-field knowledge, context length, and reasoning ability are key factors influencing LLM performances in urban analysis. We hope that SpatialLLM will provide a novel viable perspective for urban intelligent analysis and man- agement.

information, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2505.12703

Genre: Research Report (1.00)

Industry:

Transportation > Infrastructure & Services (0.46)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Yuan, Yifu, Cui, Haiqin, Chen, Yibin, Dong, Zibin, Ni, Fei, Kou, Longxin, Liu, Jinyi, Li, Pengyi, Zheng, Yan, Hao, Jianye

arXiv.org Artificial IntelligenceMay-28-2025

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

arxiv preprint arxiv, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.08548

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.49)

Add feedback

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

Neural Information Processing SystemsMay-27-2025, 21:14:08 GMT

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (i) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (ii) a flexible plugin'' module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances.

artificial intelligence, natural language, spatial reasoning, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.69)
Information Technology > Artificial Intelligence > Natural Language (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.45)

Add feedback

MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling

Neural Information Processing SystemsMay-27-2025, 20:34:06 GMT

Motion generation from discrete quantization offers many advantages over continuous regression, but at the cost of inevitable approximation errors. Previous methods usually quantize the entire body pose into one code, which not only faces the difficulty in encoding all joints within one vector but also loses the spatial relationship between different joints. Differently, in this work we quantize each individual joint into one vector, which i) simplifies the quantization process as the complexity associated with a single joint is markedly lower than that of the entire pose; ii) maintains a spatial-temporal structure that preserves both the spatial relationships among joints and the temporal movement patterns; iii) yields a 2D token map, which enables the application of various 2D operations widely used in 2D images. Grounded in the 2D motion quantization, we build a spatial-temporal modeling framework, where 2D joint VQVAE, temporal-spatial 2D masking technique, and spatial-temporal 2D attention are proposed to take advantage of spatial-temporal signals among the 2D tokens. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, with a 26.6\% decrease of FID on HumanML3D and a 29.9\% decrease on KIT-ML.

motion generation, spatial relationship, spatial-temporal joint modeling, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)

Add feedback

Diffusion4D: Fast Spatial-temporal Consistent 4D generation via Video Diffusion Models

Neural Information Processing SystemsMay-27-2025, 16:13:26 GMT

The availability of large-scale multimodal datasets and advancements in diffusion models have significantly accelerated progress in 4D content generation. Most prior approaches rely on multiple images or video diffusion models, utilizing score distillation sampling for optimization or generating pseudo novel views for direct supervision. However, these methods are hindered by slow optimization speeds and multi-view inconsistency issues. Spatial and temporal consistency in 4D geometry has been extensively explored respectively in 3D-aware diffusion models and traditional monocular video diffusion models. Building on this foundation, we propose a strategy to migrate the temporal consistency in video diffusion models to the spatial-temporal consistency required for 4D generation.

consistency, diffusion model, video diffusion model, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.64)

Add feedback

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Neural Information Processing SystemsMay-27-2025, 11:28:12 GMT

Recently, token-based generation approaches have demonstrated their effectiveness in synthesizing visual content. As a representative example, non-autoregressive Transformers (NATs) can generate decent-quality images in just a few steps. NATs perform generation in a progressive manner, where the latent tokens of a resulting image are incrementally revealed step-by-step. At each step, the unrevealed image regions are padded with [MASK] tokens and inferred by NAT, with the most reliable predictions preserved as newly revealed, visible tokens. In this paper, we delve into understanding the mechanisms behind the effectiveness of NATs and uncover two important interaction patterns that naturally emerge from NAT's paradigm: Spatially (within a step), although [MASK] and visible tokens are processed uniformly by NATs, the interactions between them are highly asymmetric.

enat, rethinking spatial-temporal interaction, token-based image synthesis, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.40)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.40)

Add feedback

TorchSpatial: A Location Encoding Framework and Benchmark for Spatial Representation Learning

Neural Information Processing SystemsMay-27-2025, 09:19:11 GMT

Spatial representation learning (SRL) aims at learning general-purpose neural network representations from various types of spatial data (e.g., points, polylines, polygons, networks, images, etc.) in their native formats. Learning good spatial representations is a fundamental problem for various downstream applications such as species distribution modeling, weather forecasting, trajectory generation, geographic question answering, etc. Even though SRL has become the foundation of almost all geospatial artificial intelligence (GeoAI) research, we have not yet seen significant efforts to develop an extensive deep learning framework and benchmark to support SRL model development and evaluation. To fill this gap, we propose TorchSpatial, a learning framework and benchmark for location (point) encoding,which is one of the most fundamental data types of spatial representation learning. TorchSpatial contains three key components: 1) a unified location encoding framework that consolidates 15 commonly recognized location encoders, ensuring scalability and reproducibility of the implementations; 2) the LocBench benchmark tasks encompassing 7 geo-aware image classification and 10 geo-aware imageregression datasets; 3) a comprehensive suite of evaluation metrics to quantify geo-aware models' overall performance as well as their geographic bias, with a novel Geo-Bias Score metric.

location encoding framework and benchmark, spatial representation learning, torchspatial, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.60)

Add feedback

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Neural Information Processing SystemsMay-27-2025, 08:02:04 GMT

Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning--a fundamental component of human cognition--remains under-explored. We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning such as relationship understanding, navigation, and counting. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature: (1) Spatial reasoning poses significant challenges where competitive models can fall behind random guessing; (2) Despite additional visual input, VLMs often under-perform compared to their LLM counterparts; (3) When both textual and visual information is available, multi-modal language models become less reliant on visual information if sufficient textual clues are provided.

picture worth, spatial reasoning, vision language model, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.89)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)

Add feedback

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

Neural Information Processing SystemsMay-27-2025, 06:37:50 GMT

Current state-of-the-art spatial reasoning-enhanced VLMs are trained to excel at spatial visual question answering (VQA). However, we believe that higher-level 3D-aware tasks, such as articulating dynamic scene changes and motion planning, require a fundamental and explicit 3D understanding beyond current spatial VQA datasets. In this work, we present SpatialPIN, a framework designed to enhance the spatial reasoning capabilities of VLMs through prompting and interacting with priors from multiple 3D foundation models in a zero-shot, training-free manner. Extensive experiments demonstrate that our spatial reasoning-imbued VLM performs well on various forms of spatial VQA and can extend to help in various downstream robotics tasks such as pick and stack and trajectory planning.

enhancing spatial reasoning capability, spatialpin, vision-language model

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback