AITopics

2508.01415

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

arXiv.org Artificial IntelligenceOct-22-2025

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

Chen, Zhangquan, Zhang, Manyuan, Yu, Xinlei, Luo, Xufang, Sun, Mingze, Pan, Zihao, Feng, Yan, Pei, Peng, Cai, Xunliang, Huang, Ruqi

Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. T o address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D men-taling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multi-modal reasoning.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

2510.18632

Country: Asia (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
(2 more...)

arXiv.org Artificial IntelligenceOct-22-2025

HouseTour: A Virtual Real Estate A(I)gent

Çelen, Ata, Pollefeys, Marc, Barath, Daniel, Armeni, Iro

W e introduce HouseT our, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. W e synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. T o support this task, we present the HouseT our dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. W e evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.

large language model, machine learning, trajectory, (16 more...)

2510.18054

Genre: Research Report > New Finding (0.46)

Industry: Banking & Finance > Real Estate (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceOct-22-2025

NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly?

Peng, Jierui, Zhang, Yanyan, Duan, Yicheng, Liang, Tuo, Chaudhary, Vipin, Yin, Yu

The evaluation of Vision-Language-Action (VLA) agents is hindered by the coarse, end-task success metric that fails to provide precise skill diagnosis or measure robustness to real-world perturbations. This challenge is exacerbated by a fragmented data landscape that impedes reproducible research and the development of generalist models. To address these limitations, we introduce NEBULA, a unified ecosystem for single-arm manipulation that enables diagnostic and reproducible evaluation. NEBULA features a novel dual-axis evaluation protocol that combines fine-grained capability tests for precise skill diagnosis with systematic stress tests that measure robustness. A standardized API and a large-scale, aggregated dataset are provided to reduce fragmentation and support cross-dataset training and fair comparison. Using NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities such as spatial reasoning and dynamic adaptation, which are consistently obscured by conventional end-task success metrics. By measuring both what an agent can do and when it does so reliably, NEBULA provides a practical foundation for robust, general-purpose embodied agents.

artificial intelligence, machine learning, spatial reasoning, (15 more...)

2510.16263

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.49)

Yao, Ying, Graham, Daniel J.

Interval Prediction of Annual Average Daily Traffic on Local Roads via Quantile Random Forest with High-Dimensional Spatial Data

arXiv.org Machine LearningOct-22-2025

Accurate annual average daily traffic (AADT) data are vital for transport planning and infrastructure management. However, automatic traffic detectors across national road networks often provide incomplete coverage, leading to underrepresentation of minor roads. While recent machine learning advances have improved AADT estimation at unmeasured locations, most models produce only point predictions and overlook estimation uncertainty. This study addresses that gap by introducing an interval prediction approach that explicitly quantifies predictive uncertainty. We integrate a Quantile Random Forest model with Principal Component Analysis to generate AADT prediction intervals, providing plausible traffic ranges bounded by estimated minima and maxima. Using data from over 2,000 minor roads in England and Wales, and evaluated with specialized interval metrics, the proposed method achieves an interval coverage probability of 88.22%, a normalized average width of 0.23, and a Winkler Score of 7,468.47. By combining machine learning with spatial and high-dimensional analysis, this framework enhances both the accuracy and interpretability of AADT estimation, supporting more robust and informed transport planning.

artificial intelligence, machine learning, prediction, (20 more...)

arXiv.org Machine Learning

2510.18548

Country:

North America > United States (1.00)
Asia (1.00)
Europe > United Kingdom > England (0.48)

Genre:

Research Report > New Finding (0.93)
Overview (0.93)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.86)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Ortega-Peimbert, Jesús, Busch, Finn Lukas, Homberger, Timon, Yang, Quantao, Andersson, Olov

DIV-Nav: Open-Vocabulary Spatial Relationships for Multi-Object Navigation

Abstract-- Advances in open-vocabulary semantic mapping and object navigation have enabled robots to perform an informed search of their environment for an arbitrary object. However, such zero-shot object navigation is typically designed for simple queries with an object name like "television" or "blue rug". Here, we consider more complex free-text queries with spatial relationships, such as "find the remote on the table" while still leveraging robustness of a semantic map. We present DIV-Nav, a real-time navigation system that efficiently addresses this problem through a series of relaxations: i) Decomposing natural language instructions with complex spatial constraints into simpler object-level queries on a semantic map, ii) computing the Intersection of individual semantic belief maps to identify regions where all objects co-exist, and iii) V alidating the discovered objects against the original, complex spatial constrains via a L VLM. We further investigate how to adapt the frontier exploration objectives of online semantic mapping to such spatial search queries to more effectively guide the search process. Robots operating in human environments must interpret natural language commands that go beyond simple object identification. While a command like "find a chair" requires handling simple object classes only, real-world search instructions often specify spatial relationships: "go to the chair next to the desk," "find the towel in the bathroom," or "get the book on the nightstand."

large language model, machine learning, natural language, (22 more...)

2510.16518

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
(4 more...)

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

Zhang, Zhengshen, Li, Hao, Dai, Yalun, Zhu, Zhengbang, Zhou, Lei, Liu, Chenchen, Wang, Dong, Tay, Francis E. H., Chen, Sijin, Liu, Ziwei, Liu, Yuxiao, Li, Xinghang, Zhou, Pan

Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.

arxiv preprint arxiv, machine learning, natural language, (18 more...)

2510.17439

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Pursuing Minimal Sufficiency in Spatial Reasoning

Guo, Yejie, Hou, Yunzhong, Ma, Wufei, Tang, Meng, Yang, Ming-Hsuan

Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.

information, large language model, machine learning, (17 more...)

2510.16688

Country:

North America > United States (0.28)
Asia (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

Peng, Kunyu, Wen, Di, Fu, Jia, Wu, Jiamin, Yang, Kailun, Zheng, Junwei, Liu, Ruiping, Chen, Yufan, Fu, Yuqian, Paudel, Danda Pani, Van Gool, Luc, Stiefelhagen, Rainer

Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.

large language model, machine learning, natural language, (19 more...)

2510.16444

Country:

Europe (0.46)
Asia > China (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceOct-20-2025

Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

Wang, Xingrui, Ma, Wufei, Zhang, Tiezheng, de Melo, Celso M, Chen, Jieneng, Yuille, Alan

Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. T o address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. W e develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. W e evaluated various large multimodal models (LMMs) on Spatial457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. T o quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.

large language model, machine learning, natural language, (18 more...)

2502.08636

Genre: Research Report (0.50)

Industry: Transportation (0.98)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)