AITopics | Spatial Reasoning

Spatial graphs are particular graphs for which the nodes are localized in space (e.g., public transport network, molecules, branching biological structures). In this work, we consider the problem of spatial graph reduction, that aims to find a smaller spatial graph (i.e., with less nodes) with the same overall structure as the initial one. In this context, performing the graph reduction while preserving the main topological features of the initial graph is particularly relevant, due to the additional spatial information. Thus, we propose a topological spatial graph coarsening approach based on a new framework that finds a trade-off between the graph reduction and the preservation of the topological characteristics. The coarsening is realized by collapsing short edges. In order to capture the topological information required to calibrate the reduction level, we adapt the construction of classical topological descriptors made for point clouds (the so-called persistent diagrams) to spatial graphs. This construction relies on the introduction of a new filtration called triangle-aware graph filtration. Our coarsening approach is parameter-free and we prove that it is equivariant under rotations, translations and scaling of the initial spatial graph. We evaluate the performances of our method on synthetic and real spatial graphs, and show that it significantly reduces the graph sizes while preserving the relevant topological information.

artificial intelligence, machine learning, spatial reasoning, (19 more...)

arXiv.org Machine Learning

2512.24327

Country: Europe > France (0.28)

Genre: Research Report (0.82)

Industry: Transportation > Infrastructure & Services (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.66)

Add feedback

SpelsNet: Surface Primitive Elements Segmentation by B-Rep Graph Structure Supervision

Neural Information Processing SystemsDec-27-2025, 16:18:04 GMT

Shah and Mäntylä [1995] from an unstructured 3D scan ( e.g .

dataset, point cloud, spelsnet, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)

Add feedback

Encoding Spatial Distribution of Convolutional Features for Texture Representation

Neural Information Processing SystemsDec-24-2025, 20:46:23 GMT

However, GAP cannot well characterize complex distributive patterns of spatial features while such patterns play an important role in texture-oriented applications, e.g., material recognition and ground terrain classification. In the context of texture representation, this paper addressed the issue by proposing Fractal Encoding (FE), a feature encoding module grounded by multi-fractal geometry. Considering a CNN feature map as a union of level sets of points lying in the 2D space, FE characterizes their spatial layout via a local-global hierarchical fractal analysis which examines the multi-scale power behavior on each level set. This enables a CNN to encode the regularity on the spatial arrangement of image features, leading to a robust yet discriminative spectrum descriptor. In addition, FE has trainable parameters for data adaptivity and can be easily incorporated into existing CNNs for end-to-end training. We applied FE to ResNet-based texture classification and retrieval, and demonstrated its effectiveness on several benchmark datasets.

convolutional feature, encoding spatial distribution, name change, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.60)
Information Technology > Artificial Intelligence > Machine Learning (0.37)

Add feedback

Geometric Exploitation for Indoor Panoramic Semantic Segmentation

Neural Information Processing SystemsDec-24-2025, 18:32:00 GMT

PAnoramic Semantic Segmentation (PASS) is an important task in computer vision,as it enables semantic understanding of a 360 environment. Currently,most of existing works have focused on addressing the distortion issues in 2Dpanoramic images without considering spatial properties of indoor scene. Thisrestricts PASS methods in perceiving contextual attributes to deal with the ambiguitywhen working with monocular images. In this paper, we propose a novelapproach for indoor panoramic semantic segmentation. Unlike previous works,we consider the panoramic image as a composition of segment groups: oversampledsegments, representing planar structures such as floors and ceilings, andunder-sampled segments, representing other scene elements. To optimize eachgroup, we first enhance over-sampled segments by jointly optimizing with a densedepth estimation task. Then, we introduce a transformer-based context modulethat aggregates different geometric representations of the scene, combinedwith a simple high-resolution branch, it serves as a robust hybrid decoder forestimating under-sampled segments, effectively preserving the resolution of predictedmasks while leveraging various indoor geometric properties.

artificial intelligence, machine learning, spatial reasoning, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.60)
Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Object-Centric Representation Learning with Generative Spatial-Temporal Factorization

Neural Information Processing SystemsDec-24-2025, 04:02:26 GMT

Learning object-centric scene representations is essential for attaining structural understanding and abstraction of complex scenes. Yet, as current approaches for unsupervised object-centric representation learning are built upon either a stationary observer assumption or a static scene assumption, they often: i) suffer single-view spatial ambiguities, or ii) infer incorrectly or inaccurately object representations from dynamic scenes. To address this, we propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the scope of multi-view object-centric representation learning to dynamic scenes. We train DyMON on multi-view-dynamic-scene data and show that DyMON learns---without supervision---to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations, and constructs scene object spatial representations suitable for rendering at arbitrary times (querying across time) and from arbitrary viewpoints (querying across space). We also show that the factorized scene representations (w.r.t.

generative spatial-temporal factorization, object-centric representation learning, representation, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.40)

Add feedback

Learning Optical Flow from Continuous Spike Streams

Neural Information Processing SystemsDec-24-2025, 00:28:29 GMT

Spike camera is an emerging bio-inspired vision sensor with ultra-high temporal resolution. It records scenes by accumulating photons and outputting continuous binary spike streams. Optical flow is a key task for spike cameras and their applications. A previous attempt has been made for spike-based optical flow. However, the previous work only focuses on motion between two moments, and it uses graphics-based data for training, whose generalization is limited. In this paper, we propose a tailored network, Spike2Flow that extracts information from binary spikes with temporal-spatial representation based on the differential of spike firing time and spatial information aggregation.

continuous spike stream, learning optical flow, name change, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.61)

Add feedback

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yan, Yuping, Xie, Yuhan, Zhang, Yixin, Lyu, Lingjuan, Wang, Handing, Jin, Yaochu

arXiv.org Artificial IntelligenceDec-12-2025

Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-F ool, a comprehensive study of mul-timodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-F ool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. W e further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

arxiv preprint arxiv, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.16203

Genre: Research Report (0.50)

Industry:

Information Technology > Security & Privacy (0.86)
Government > Military (0.86)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
(2 more...)

Add feedback

Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Xing, Hao, Boey, Kai Zhe, Wu, Yuankai, Burschka, Darius, Cheng, Gordon

arXiv.org Artificial IntelligenceDec-12-2025

Abstract-- Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settin gs, where a precise understanding of sub-activity labels and their tem poral structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of act ion sequences. T o address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 3 0 fps) motion data (skeleton and object detections) to mitiga te fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enh ance spatial representation robustness. Second, a temporal gra ph fusion module that aligns multi-modal inputs with differin g resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we desi gn SmoothLabelMix, a data augmentation technique that mixes i n-put sequences and labels to generate synthetic training exa mples with gradual action transitions, enhancing temporal consi stency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understand ing, demonstrate that our approach outperforms state-of-the-a rt methods, especially in action segmentation accuracy, achi eving F1@10: 94.5% and F1@25: 92.8%. I. INTRODUCTION Human action segmentation, the task of temporally decomposing continuous activities into coherent sub-action uni ts, is a cornerstone of intelligent robotic systems operating in collaborative environments.

action recognition, artificial intelligence, spatial reasoning, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/IROS60139.2025.11245867

2507.00752

Country: Europe > Germany (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.66)

Add feedback