Goto

Collaborating Authors

 Spatial Reasoning


Encoding Spatial Distribution of Convolutional Features for Texture Representation

Neural Information Processing Systems

However, GAP cannot well characterize complex distributive patterns of spatial features while such patterns play an important role in texture-oriented applications, e.g., material recognition and ground terrain classification. In the context of texture representation, this paper addressed the issue by proposing Fractal Encoding (FE), a feature encoding module grounded by multi-fractal geometry. Considering a CNN feature map as a union of level sets of points lying in the 2D space, FE characterizes their spatial layout via a local-global hierarchical fractal analysis which examines the multi-scale power behavior on each level set. This enables a CNN to encode the regularity on the spatial arrangement of image features, leading to a robust yet discriminative spectrum descriptor. In addition, FE has trainable parameters for data adaptivity and can be easily incorporated into existing CNNs for end-to-end training. We applied FE to ResNet-based texture classification and retrieval, and demonstrated its effectiveness on several benchmark datasets.


Geometric Exploitation for Indoor Panoramic Semantic Segmentation

Neural Information Processing Systems

PAnoramic Semantic Segmentation (PASS) is an important task in computer vision,as it enables semantic understanding of a 360 environment. Currently,most of existing works have focused on addressing the distortion issues in 2Dpanoramic images without considering spatial properties of indoor scene. Thisrestricts PASS methods in perceiving contextual attributes to deal with the ambiguitywhen working with monocular images. In this paper, we propose a novelapproach for indoor panoramic semantic segmentation. Unlike previous works,we consider the panoramic image as a composition of segment groups: oversampledsegments, representing planar structures such as floors and ceilings, andunder-sampled segments, representing other scene elements. To optimize eachgroup, we first enhance over-sampled segments by jointly optimizing with a densedepth estimation task. Then, we introduce a transformer-based context modulethat aggregates different geometric representations of the scene, combinedwith a simple high-resolution branch, it serves as a robust hybrid decoder forestimating under-sampled segments, effectively preserving the resolution of predictedmasks while leveraging various indoor geometric properties.


Object-Centric Representation Learning with Generative Spatial-Temporal Factorization

Neural Information Processing Systems

Learning object-centric scene representations is essential for attaining structural understanding and abstraction of complex scenes. Yet, as current approaches for unsupervised object-centric representation learning are built upon either a stationary observer assumption or a static scene assumption, they often: i) suffer single-view spatial ambiguities, or ii) infer incorrectly or inaccurately object representations from dynamic scenes. To address this, we propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the scope of multi-view object-centric representation learning to dynamic scenes. We train DyMON on multi-view-dynamic-scene data and show that DyMON learns---without supervision---to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations, and constructs scene object spatial representations suitable for rendering at arbitrary times (querying across time) and from arbitrary viewpoints (querying across space). We also show that the factorized scene representations (w.r.t.


Learning Optical Flow from Continuous Spike Streams

Neural Information Processing Systems

Spike camera is an emerging bio-inspired vision sensor with ultra-high temporal resolution. It records scenes by accumulating photons and outputting continuous binary spike streams. Optical flow is a key task for spike cameras and their applications. A previous attempt has been made for spike-based optical flow. However, the previous work only focuses on motion between two moments, and it uses graphics-based data for training, whose generalization is limited. In this paper, we propose a tailored network, Spike2Flow that extracts information from binary spikes with temporal-spatial representation based on the differential of spike firing time and spatial information aggregation.


When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yan, Yuping, Xie, Yuhan, Zhang, Yixin, Lyu, Lingjuan, Wang, Handing, Jin, Yaochu

arXiv.org Artificial Intelligence

Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-F ool, a comprehensive study of mul-timodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-F ool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. W e further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.


Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Xing, Hao, Boey, Kai Zhe, Wu, Yuankai, Burschka, Darius, Cheng, Gordon

arXiv.org Artificial Intelligence

Abstract-- Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settin gs, where a precise understanding of sub-activity labels and their tem poral structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of act ion sequences. T o address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 3 0 fps) motion data (skeleton and object detections) to mitiga te fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enh ance spatial representation robustness. Second, a temporal gra ph fusion module that aligns multi-modal inputs with differin g resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we desi gn SmoothLabelMix, a data augmentation technique that mixes i n-put sequences and labels to generate synthetic training exa mples with gradual action transitions, enhancing temporal consi stency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understand ing, demonstrate that our approach outperforms state-of-the-a rt methods, especially in action segmentation accuracy, achi eving F1@10: 94.5% and F1@25: 92.8%. I. INTRODUCTION Human action segmentation, the task of temporally decomposing continuous activities into coherent sub-action uni ts, is a cornerstone of intelligent robotic systems operating in collaborative environments.


AugLift: Uncertainty Aware Depth Descriptors for Robust 2D to 3D Pose Lifting

Warner, Nikolai, Zhang, Wenjin, Badiozamani, Hamid, Essa, Irfan, Sadhwani, Apaar

arXiv.org Artificial Intelligence

Lifting based 3D human pose estimators infer 3D joints from 2D keypoints, but often struggle to generalize to real world settings with noisy 2D detections. We revisit the input to lifting and propose AugLift, a simple augmentation of standard lifting that enriches each 2D keypoint (x, y) with an Uncertainty Aware Depth Descriptor (UADD). We run a single off the shelf monocular depth estimator to obtain a depth map, and for every keypoint with detector confidence c we extract depth statistics from its confidence scaled neighborhood, forming a compact, interpretable UADD (c, d, d_min, d_max) that captures both local geometry and reliability. AugLift is modular, requires no new sensors or architectural changes, and integrates by expanding the input layer of existing lifting models. Across four datasets and four lifting architectures, AugLift boosts cross dataset (out of distribution) performance on unseen data by an average of 10.1 percent, while also improving in distribution performance by 4.0 percent as measured by MPJPE. A post hoc analysis clarifies when and why it helps: gains are largest on novel poses and significantly occluded joints, where depth statistics resolve front back ambiguities while confidence calibrates the spatial neighborhoods from which they are drawn. We also study interaction with recent image feature lifting methods and find the signals are complementary: adding UADD to image conditioned lifting yields both ID and OOD gains. A learned depth feature extension (AugLiftV2) improves performance further while trading off interpretability. Together, these results indicate that lightweight, confidence aware depth cues are a powerful plug in for robust 2D to 3D pose lifting.


GLaD: Geometric Latent Distillation for Vision-Language-Action Models

Guo, Minghao, Cao, Meng, Tao, Jiachen, Xu, Rongtao, Yan, Yan, Liang, Xiaodan, Laptev, Ivan, Chang, Xiaojun

arXiv.org Artificial Intelligence

Abstract--Most existing Vision-Language-Action (VLA) models rely primarily on RGB information, while ignoring geometric cues crucial for spatial reasoning and manipulation. In this work, we introduce GLaD, a geometry-aware VLA framework that incorporates 3D geometric priors during pretraining through knowledge distillation. Rather than distilling geometric features solely into the vision encoder, we align the LLM's hidden states corresponding to visual tokens with features from a frozen geometry-aware vision transformer (VGGT), ensuring that geometric understanding is deeply integrated into the multimodal representations that drive action prediction. Pretrained on the Bridge dataset with this geometry distillation mechanism, GLaD achieves 94.1% average success rate across four LIBERO task suites, outperforming UniVLA (92.5%) which uses identical pretraining data. These results validate that geometry-aware pretraining enhances spatial reasoning and policy generalization without requiring explicit depth sensors or 3D annotations. ISION-LANGUAGE-ACTION (VLA) models have emerged as a promising paradigm for embodied intelligence, enabling robots to generate control actions directly from visual observations and natural language instructions. Recent works [1]-[4] have demonstrated impressive performance on diverse manipulation tasks by leveraging large-scale multimodal pretraining. These models typically combine powerful vision encoders [5]-[7] and large language models to learn generalizable visuomotor policies from extensive robot demonstration datasets. Despite these advances, current VLA architectures fundamentally lack geometric understanding, which represent the capability of perceiving spatial positions, 3D structures, and relational arrangements among objects in a scene--knowledge that is essential for robots to reason about where objects are, how they relate to each other, and how to interact with them effectively. Most VLAs rely on vision encoders pretrained with 2D contrastive objectives such as CLIP [5] or SigLIP [7], which excel at capturing semantic correspondences between images and text but do not encode 3D spatial information.


Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

Ye, Yifan, Ma, Jiaqi, Cen, Jun, Lu, Zhihe

arXiv.org Artificial Intelligence

Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on \href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}


Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach

Wang, Yiqun, Li, Lujun, Yue, Meiru, State, Radu

arXiv.org Artificial Intelligence

Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span $(t=2)$, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23\% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33\% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.