AITopics

2510.27606

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)

arXiv.org Artificial IntelligenceNov-26-2025

FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

Zhou, Xiaolin, Xiao, Tingyang, Liu, Liu, Wang, Yucheng, Chen, Maiyue, Meng, Xinrui, Wang, Xinjie, Feng, Wei, Sui, Wei, Su, Zhizhong

Visual-Language Navigation (VLN) is a fundamental challenge in robotic systems, with broad applications for the deployment of embodied agents in real-world environments. Despite recent advances, existing approaches are limited in long-range spatial reasoning, often exhibiting low success rates and high inference latency, particularly in long-range navigation tasks. To address these limitations, we propose FSR-VLN, a vision-language navigation system that combines a Hierarchical Multi-modal Scene Graph (HMSG) with Fast-to-Slow Navigation Reasoning (FSR). The HMSG provides a multi-modal map representation supporting progressive retrieval, from coarse room-level localization to fine-grained goal view and object identification. Building on HMSG, FSR first performs fast matching to efficiently select candidate rooms, views, and objects, then applies VLM-driven refinement for final goal selection. We evaluated FSR-VLN across four comprehensive indoor datasets collected by humanoid robots, utilizing 87 instructions that encompass a diverse range of object categories. FSR-VLN achieves state-of-the-art (SOTA) performance in all datasets, measured by the retrieval success rate (RSR), while reducing the response time by 82% compared to VLM-based methods on tour videos by activating slow reasoning only when fast intuition fails. Furthermore, we integrate FSR-VLN with speech interaction, planning, and control modules on a Unitree-G1 humanoid robot, enabling natural language interaction and real-time navigation.

large language model, natural language, navigation, (18 more...)

2509.13733

Country: Asia > China (0.14)

Genre: Research Report (0.42)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.66)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.54)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.48)

Mixture of Horizons in Action Chunking

Jing, Dong, Wang, Gang, Liu, Jiaqi, Tang, Weiliang, Sun, Zelong, Yao, Yunchao, Wei, Zhenyu, Liu, Yunhui, Lu, Zhiwu, Ding, Mingyu

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons

artificial intelligence, arxiv preprint arxiv, machine learning, (17 more...)

2511.19433

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)

Liu, Zesheng, Rahnemoonfar, Maryam

GRIT-LP: Graph Transformer with Long-Range Skip Connection and Partitioned Spatial Graphs for Accurate Ice Layer Thickness Prediction

Graph transformers have demonstrated remarkable capability on complex spatio-temporal tasks, yet their depth is often limited by oversmoothing and weak long-range dependency modeling. To address these challenges, we introduce GRIT -LP, a graph transformer explicitly designed for polar ice-layer thickness estimation from polar radar imagery. Accurately estimating ice layer thickness is critical for understanding snow accumulation, reconstructing past climate patterns and reducing uncertainties in projections of future ice sheet evolution and sea level rise. GRIT -LP combines an inductive geometric graph learning framework with self-attention mechanism, and introduces two major innovations that jointly address challenges in modeling the spatio-temporal patterns of ice layers: a partitioned spatial graph construction strategy that forms overlapping, fully connected local neighborhoods to preserve spatial coherence and suppress noise from irrelevant long-range links, and a long-range skip connection mechanism within the transformer that improves information flow and mitigates oversmooth-ing in deeper attention layers. We conducted extensive experiments, demonstrating that GRIT -LP outperforms current state-of-the-art methods with a 24.92% improvement in root mean squared error. These results highlight the effectiveness of graph transformers in modeling spatiotemporal patterns by capturing both localized structural features and long-range dependencies across internal ice layers, and demonstrate their potential to advance data-driven understanding of cryospheric processes. Introduction Graph transformers have proven to be highly effective for modeling complex graph-structured data, with wide-range of applications in real-world scenarios, particularly those involving spatiotemporal patterns. Their ability to capture intricate relationships and dependencies makes them highly valuable in domains such as pedestrian trajectory prediction [1] and traffic prediction [2]. Despite their success, current graph transformer architectures face notable limitations, including overfitting and over-smoothing--a phenomenon where node features become indistinguishable as layers deepen [3]. Additionally, many existing graph transformers are relatively shallow, limiting their ability to effectively capture the complex, long-range dependencies that often emerge in real-world datasets.

artificial intelligence, machine learning, spatial reasoning, (20 more...)

2511.18716

Country: North America > United States (0.68)

Genre: Research Report > Promising Solution (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.94)

GVD-TG: Topological Graph based on Fast Hierarchical GVD Sampling for Robot Exploration

Li, Yanbin, Xiao, Canran, Yuan, Shenghai, Yu, Peilai, Li, Ziruo, Zhang, Zhiguo, Chi, Wenzheng, Zhang, Wei

Topological maps are more suitable than metric maps for robotic exploration tasks. However, real-time updating of accurate and detail-rich environmental topological maps remains a challenge. This paper presents a topological map updating method based on the Generalized Voronoi Diagram (GVD). First, the newly observed areas are denoised to avoid low-efficiency GVD nodes misleading the topological structure. Subsequently, a multi-granularity hierarchical GVD generation method is designed to control the sampling granularity at both global and local levels. This not only ensures the accuracy of the topological structure but also enhances the ability to capture detail features, reduces the probability of path backtracking, and ensures no overlap between GVDs through the maintenance of a coverage map, thereby improving GVD utilization efficiency. Second, a node clustering method with connectivity constraints and a connectivity method based on a switching mechanism are designed to avoid the generation of unreachable nodes and erroneous nodes caused by obstacle attraction. A special cache structure is used to store all connectivity information, thereby improving exploration efficiency. Finally, to address the issue of frontiers misjudgment caused by obstacles within the scope of GVD units, a frontiers extraction method based on morphological dilation is designed to effectively ensure the reachability of frontiers. On this basis, a lightweight cost function is used to assess and switch to the next viewpoint in real time. This allows the robot to quickly adjust its strategy when signs of path backtracking appear, thereby escaping the predicament and increasing exploration flexibility. And the performance of system for exploration task is verified through comparative tests with SOTA methods.

artificial intelligence, machine learning, node, (17 more...)

2511.18708

Country: Asia > China (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Xu, Rui, Lu, Dakuan, Zhao, Zicheng, Tan, Xiaoyu, Wang, Xintao, Yuan, Siyu, Chen, Jiangjie, Xu, Yinghui

Spatial reasoning is a key capability in the field of artificial intelligence, especially crucial in areas such as robotics, computer vision, and natural language understanding. However, evaluating the ability of multimodal large language models(MLLMs) in complex spatial reasoning still faces challenges, particularly in scenarios requiring multi-step reasoning and precise mathematical constraints. This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances,each comprising a strictly formatted crease pattern (CP diagram), the Compiled Flat Pattern, the complete Folding Process, and the final Folded Shape Image. We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation. For the CP code generation task, we design an interactive environment and explore the possibility of using reinforcement learning methods to train MLLMs. Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.

large language model, machine learning, natural language, (21 more...)

2511.1845

Genre: Research Report > Experimental Study (1.00)

Industry:

Education (0.68)
Transportation > Passenger (0.45)
Transportation > Ground > Road (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

EchoVLA: Robotic Vision-Language-Action Model with Synergistic Declarative Memory for Mobile Manipulation

Lin, Min, Liang, Xiwen, Lin, Bingqian, Jingzhi, Liu, Jiao, Zijian, Li, Kehan, Ma, Yuhan, Liu, Yuecheng, Zhao, Shen, Zhuang, Yuzheng, Liang, Xiaodan

Recent progress in Vision-Language-Action (VLA) models has enabled embodied agents to interpret multimodal instructions and perform complex tasks. However, existing VLAs are mostly confined to short-horizon, table-top manipulation, lacking the memory and reasoning capability required for long-horizon mobile manipulation, where agents must coordinate navigation and manipulation under changing spatial contexts. In this work, we present EchoVLA, a memory-aware VLA model for long-horizon mobile manipulation. EchoVLA incorporates a synergistic declarative memory inspired by the human brain, consisting of a scene memory that maintains a collection of spatial-semantic maps and an episodic memory that stores task-level experiences with multimodal contextual features. During both training and inference, the two memories are individually stored, updated, and retrieved based on current observations, task history, and instructions, and their retrieved representations are fused via coarse- and fine-grained attention to guide mobile-arm diffusion policies. To support large-scale training and evaluation, we further introduce MoMani, an automated benchmark that generates expert-level long-horizon trajectories through multimodal large language model (MLLM)-guided planning and feedback-driven refinement, supplemented with real-robot demonstrations. Experiments in simulated and real-world settings show that EchoVLA improves long-horizon performance, reaching 0.52 SR on manipulation/navigation and 0.31 on mobile manipulation, exceeding $π_{0.5}$ by +0.08 and +0.11.

artificial intelligence, large language model, natural language, (17 more...)

2511.18112

Genre: Research Report (0.50)

Industry: Health & Medicine (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.48)

ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

Ye, Wencheng, Wang, Tianshi, Zhu, Lei, Li, Fengling, Yang, Guoli

Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 speedup, thereby establishing a general paradigm toward efficient embodied intelligence. Source codes can be found at https://github.com/

machine learning, natural language, object-oriented architecture, (17 more...)

2511.18082

Genre: Research Report (0.82)

Industry: Education (0.54)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.67)
(2 more...)

Holvoet, Freek, Blier-Wong, Christopher, Antonio, Katrien

A multi-view contrastive learning framework for spatial embeddings in risk modelling

Incorporating spatial information, particularly those influenced by climate, weather, and demographic factors, is crucial for improving underwriting precision and enhancing risk management in insurance. However, spatial data are often unstructured, high-dimensional, and difficult to integrate into predictive models. Embedding methods are needed to convert spatial data into meaningful representations for modelling tasks. We propose a novel multi-view contrastive learning framework for generating spatial embeddings that combine information from multiple spatial data sources. To train the model, we construct a spatial dataset that merges satellite imagery and OpenStreetMap features across Europe. The framework aligns these spatial views with coordinate-based encodings, producing low-dimensional embeddings that capture both spatial structure and contextual similarity. Once trained, the model generates embeddings directly from latitude-longitude pairs, enabling any dataset with coordinates to be enriched with meaningful spatial features without requiring access to the original spatial inputs. In a case study on French real estate prices, we compare models trained on raw coordinates against those using our spatial embeddings as inputs. The embeddings consistently improve predictive accuracy across generalised linear, additive, and boosting models, while providing interpretable spatial effects and demonstrating transferability to unseen regions.

artificial intelligence, machine learning, spatial reasoning, (18 more...)

2511.17954

Country:

North America > United States > New York (0.28)
Europe > France > Grand Est (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine (1.00)
Banking & Finance > Real Estate (0.90)
Banking & Finance > Insurance (0.87)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Yang, Cheng, Wan, Haiyuan, Peng, Yiran, Cheng, Xin, Yu, Zhaoyang, Zhang, Jiayi, Yu, Junchi, Yu, Xinlei, Zheng, Xiawu, Zhou, Dongzhan, Wu, Chenglin

Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.

large language model, machine learning, natural language, (19 more...)

2511.15065

Country: Asia > China (0.67)

Genre: Research Report > New Finding (0.67)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
(2 more...)