AITopics

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

large language model, machine learning, natural language, (21 more...)

2511.06146

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
(2 more...)

3dSAGER: Geospatial Entity Resolution over 3D Objects (Technical Report)

Genossar, Bar, Dalyot, Sagi, Shraga, Roee, Gal, Avigdor

Urban environments are continuously mapped and modeled by various data collection platforms, including satellites, unmanned aerial vehicles and street cameras. The growing availability of 3D geospatial data from multiple modalities has introduced new opportunities and challenges for integrating spatial knowledge at scale, particularly in high-impact domains such as urban planning and rapid disaster management. Geospatial entity resolution is the task of identifying matching spatial objects across different datasets, often collected independently under varying conditions. Existing approaches typically rely on spatial proximity, textual metadata, or external identifiers to determine correspondence. While useful, these signals are often unavailable, unreliable, or misaligned, especially in cross-source scenarios. To address these limitations, we shift the focus to the intrinsic geometry of 3D spatial objects and present 3dSAGER (3D Spatial-Aware Geospatial Entity Resolution), an end-to-end pipeline for geospatial entity resolution over 3D objects. 3dSAGER introduces a novel, spatial-reference-independent featurization mechanism that captures intricate geometric characteristics of matching pairs, enabling robust comparison even across datasets with incompatible coordinate systems where traditional spatial methods fail. As a key component of 3dSAGER, we also propose a new lightweight and interpretable blocking method, BKAFI, that leverages a trained model to efficiently generate high-recall candidate sets. We validate 3dSAGER through extensive experiments on real-world urban datasets, demonstrating significant gains in both accuracy and efficiency over strong baselines. Our empirical study further dissects the contributions of each component, providing insights into their impact and the overall design choices.

artificial intelligence, machine learning, natural language, (21 more...)

doi: 10.1145/3769751

2511.063

Country:

Europe (0.46)
Asia > Middle East > Israel (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.34)

Do Street View Imagery and Public Participation GIS align: Comparative Analysis of Urban Attractiveness

Malekzadeh, Milad, Willberg, Elias, Torkko, Jussi, Korpilo, Silviya, Hasanzadeh, Kamyar, Järv, Olle, Toivonen, Tuuli

As digital tools increasingly shape spatial planning practices, understanding how different data sources reflect human experiences of urban environments is essential. Street View Imagery (SVI) and Public Participation GIS (PPGIS) represent two prominent approaches for capturing place-based perceptions that can support urban planning decisions, yet their comparability remains underexplored. This study investigates the alignment between SVI-based perceived attractiveness and residents' reported experiences gathered via a city-wide PPGIS survey in Helsinki, Finland. Using participant-rated SVI data and semantic image segmentation, we trained a machine learning model to predict perceived attractiveness based on visual features. We compared these predictions to PPGIS-identified locations marked as attractive or unattractive, calculating agreement using two sets of strict and moderate criteria. Our findings reveal only partial alignment between the two datasets. While agreement (with a moderate threshold) reached 67% for attractive and 77% for unattractive places, agreement (with a strict threshold) dropped to 27% and 29%, respectively. By analysing a range of contextual variables, including noise, traffic, population presence, and land use, we found that non-visual cues significantly contributed to mismatches. The model failed to account for experiential dimensions such as activity levels and environmental stressors that shape perceptions but are not visible in images. These results suggest that while SVI offers a scalable and visual proxy for urban perception, it cannot fully substitute the experiential richness captured through PPGIS. We argue that both methods are valuable but serve different purposes; therefore, a more integrated approach is needed to holistically capture how people perceive urban environments.

artificial intelligence, attractiveness, machine learning, (19 more...)

2511.0557

Country: Europe > Finland > Uusimaa > Helsinki (0.26)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.68)
Transportation > Infrastructure & Services (0.46)
Transportation > Ground > Rail (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.88)
(2 more...)

SIG-Chat: Spatial Intent-Guided Conversational Gesture Generation Involving How, When and Where

Huang, Yiheng, Peng, Junran, Shen, Silei, Yang, Jingwei, Wei, ZeJi, Bai, ChenCheng, He, Yonghao, Sui, Wei, Sun, Muyi, Liu, Yan, Yin, Xu-Cheng, Zhang, Man, Zhang, Zhaoxiang, Luo, Chuanchen

The accompanying actions and gestures in dialogue are often closely linked to interactions with the environment, such as looking toward the interlocutor or using gestures to point to the described target at appropriate moments. Speech and semantics guide the production of gestures by determining their timing (WHEN) and style (HOW), while the spatial locations of interactive objects dictate their directional execution (WHERE). Existing approaches either rely solely on descriptive language to generate motions or utilize audio to produce non-interactive gestures, thereby lacking the characterization of interactive timing and spatial intent. This significantly limits the applicability of conversational gesture generation, whether in robotics or in the fields of game and animation production. To address this gap, we present a full-stack solution. We first established a unique data collection method to simultaneously capture high-precision human motion and spatial intent. We then developed a generation model driven by audio, language, and spatial data, alongside dedicated metrics for evaluating interaction timing and spatial accuracy. Finally, we deployed the solution on a humanoid robot, enabling rich, context-aware physical interactions.

artificial intelligence, machine learning, proceedings, (16 more...)

2509.23852

Country: Asia (0.28)

Genre: Research Report (0.50)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.66)
(2 more...)

Wang, Sirui, He, Jiang, Andreo, Natàlia Blasco, Zhu, Xiao Xiang

GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution

Improving the quality of hyperspectral images (HSIs), such as through super-resolution, is a crucial research area. However, generative modeling for HSIs presents several challenges. Due to their high spectral dimensionality, HSIs are too memory-intensive for direct input into conventional diffusion models. Furthermore, general generative models lack an understanding of the topological and geometric structures of ground objects in remote sensing imagery. In addition, most diffusion models optimize loss functions at the noise level, leading to a non-intuitive convergence behavior and suboptimal generation quality for complex data. To address these challenges, we propose a Geometric Enhanced Wavelet-based Diffusion Model (GEWDiff), a novel framework for reconstructing hyperspectral images at 4-times super-resolution. A wavelet-based encoder-decoder is introduced that efficiently compresses HSIs into a latent space while preserving spectral-spatial information. To avoid distortion during generation, we incorporate a geometry-enhanced diffusion process that preserves the geometric features. Furthermore, a multi-level loss function was designed to guide the diffusion process, promoting stable convergence and improved reconstruction fidelity. Our model demonstrated state-of-the-art results across multiple dimensions, including fidelity, spectral accuracy, visual realism, and clarity.

artificial intelligence, diffusion model, machine learning, (15 more...)

2511.07103

Country:

Europe (0.68)
North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.66)

LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation

Wang, Zijie, Zhang, Weiming, Zhang, Wei, Tan, Xiao, Liu, Hongxing, Wang, Yaowei, Li, Guanbin

Centerline graphs, crucial for path planning in autonomous driving, are traditionally learned using deterministic methods. However, these methods often lack spatial reasoning and struggle with occluded or invisible centerlines. Generative approaches, despite their potential, remain underex-plored in this domain. W e introduce LaneDiffusion, a novel generative paradigm for centerline graph learning. LaneDiffusion innovatively employs diffusion models to generate lane centerline priors at the Bird's Eye View (BEV) feature level, instead of directly predicting vectorized centerlines. Our method integrates a Lane Prior Injection Module (LPIM) and a Lane Prior Diffusion Module (LPDM) to effectively construct diffusion targets and manage the diffusion process. Furthermore, vectorized centerlines and topologies are then decoded from these prior-injected BEV features. Extensive evaluations on the nuScenes and Ar-goverse2 datasets demonstrate that LaneDiffusion significantly outperforms existing methods, achieving improvements of 4.2%, 4.6%, 4.7%, 6.4% and 1.8% on fine-grained point-level metrics (GEO F1, TOPO F1, JTOPO F1, APLS and SDA) and 2.3%, 6.4%, 6.8% and 2.1% on segment-level metrics (IoU, mAP

artificial intelligence, machine learning, spatial reasoning, (16 more...)

2511.06272

Country: Asia > China (0.46)

Genre: Research Report (1.00)

Industry: Transportation > Ground > Road (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.66)

Syed, Shahram Najam, Ahuja, Yatharth, Jakobsson, Arthur, Ichnowski, Jeff

ExpReS-VLA: Specializing Vision-Language-Action Models Through Experience Replay and Retrieval

Vision-Language-Action models such as OpenVLA show impressive zero-shot generalization across robotic manipulation tasks but often fail to adapt efficiently to new deployment environments. In many real-world applications, consistent high performance on a limited set of tasks is more important than broad generalization. We propose ExpReS-VLA, a method for specializing pre-trained VLA models through experience replay and retrieval while preventing catastrophic forgetting. ExpReS-VLA stores compact feature representations from the frozen vision backbone instead of raw image-action pairs, reducing memory usage by approximately 97 percent. During deployment, relevant past experiences are retrieved using cosine similarity and used to guide adaptation, while prioritized experience replay emphasizes successful trajectories. We also introduce Thresholded Hybrid Contrastive Loss, which enables learning from both successful and failed attempts. On the LIBERO simulation benchmark, ExpReS-VLA improves success rates from 82.6 to 93.1 percent on spatial reasoning tasks and from 61 to 72.3 percent on long-horizon tasks. On physical robot experiments with five manipulation tasks, it reaches 98 percent success on both seen and unseen settings, compared to 84.7 and 32 percent for naive fine-tuning. Adaptation takes 31 seconds using 12 demonstrations on a single RTX 5090 GPU, making the approach practical for real robot deployment.

large language model, machine learning, natural language, (18 more...)

2511.06202

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.48)

MARAuder's Map: Motion-Aware Real-time Activity Recognition with Layout-Based Trajectories

Liu, Zishuai, You, Weihang, Lu, Jin, Dou, Fei

Ambient sensor-based human activity recognition (HAR) in smart homes remains challenging due to the need for real-time inference, spatially grounded reasoning, and context-aware temporal modeling. Existing approaches often rely on pre-segmented, within-activity data and overlook the physical layout of the environment, limiting their robustness in continuous, real-world deployments. In this paper, we propose MARAuder's Map, a novel framework for real-time activity recognition from raw, unsegmented sensor streams. Our method projects sensor activations onto the physical floorplan to generate trajectory-aware, image-like sequences that capture the spatial flow of human movement. These representations are processed by a hybrid deep learning model that jointly captures spatial structure and temporal dependencies. To enhance temporal awareness, we introduce a learnable time embedding module that encodes contextual cues such as hour-of-day and day-of-week. Additionally, an attention-based encoder selectively focuses on informative segments within each observation window, enabling accurate recognition even under cross-activity transitions and temporal ambiguity. Extensive experiments on multiple real-world smart home datasets demonstrate that our method outperforms strong baselines, offering a practical solution for real-time HAR in ambient sensor environments.

large language model, machine learning, real time system, (22 more...)

2511.05773

Country: North America (0.15)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Smart Houses & Appliances (0.72)
Health & Medicine > Consumer Health (0.46)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Architecture > Real Time Systems (1.00)
(4 more...)

arXiv.org Artificial IntelligenceNov-10-2025

iFlyBot-VLM Technical Report

Nie, Xin, Cheng, Zhiyuan, Zhang, Yuan, Ji, Chao, Wu, Jiajia, Zhang, Yuhan, Pan, Jia

We introduce iFlyBot-VLM, a general-purpose Vision-Language Model (VLM) used to improve the domain of Embodied Intelligence. The central objective of iFlyBot-VLM is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robotic motion control. To this end, the model abstracts complex visual and spatial information into a body-agnostic and transferable Operational Language, thereby enabling seamless perception-action closed-loop coordination across diverse robotic platforms. The architecture of iFlyBot-VLM is systematically designed to realize four key functional capabilities essential for embodied intelligence: 1) Spatial Understanding and Metric Reasoning; 2) Interactive Target Grounding; 3) Action Abstraction and Control Parameter Generation; 4) Task Planning and Skill Sequencing. We envision iFlyBot-VLM as a scalable and generalizable foundation model for embodied AI, facilitating the progression from specialized task-oriented systems toward generalist, cognitively capable agents. We conducted evaluations on 10 current mainstream embodied intelligence-related VLM benchmark datasets, such as Blink and Where2Place, and achieved optimal performance while preserving the model's general capabilities. We will publicly release both the training data and model weights to foster further research and development in the field of Embodied Intelligence.

large language model, machine learning, natural language, (19 more...)

2511.04976

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

Tupayachi, Jose, Camur, Mustafa C., Heaslip, Kevin, Li, Xueping

Spatio-Temporal Graph Convolutional Networks for EV Charging Demand Forecasting Using Real-World Multi-Modal Data Integration

arXiv.org Artificial IntelligenceNov-10-2025

Transportation remains a major contributor to greenhouse gas emissions, highlighting the urgency of transitioning toward sustainable alternatives such as Electric Vehicles (EVs). Yet, uneven spatial distribution and irregular utilization of charging infrastructure create challenges for both power grid stability and investment planning. This study introduces Traffic-Weather Graph Convolutional Network (TW-GCN), a spatio-temporal forecasting framework that combines Graph Convolutional Networks with temporal architectures to predict EV charging demand in Tennessee, United States. We utilize real-world traffic flows, weather conditions, and proprietary data provided by one of the largest U.S.-based EV infrastructure companies to capture both spatial dependencies and temporal dynamics. Extensive experiments across varying forecasting horizons, clustering strategies, and sequence lengths reveal that mid-horizon (3-hour) forecasts achieve the best balance between responsiveness and stability, with One-dimensional convo-lutional neural networks consistently outperforming other temporal models. Regional analysis shows disparities in predictive accuracy across East, Middle, and West Tennessee, reflecting how station density, Points of Interest and local demand variability shape model capabilities. The proposed TW-GCN framework advances the integration of data-driven intelligence into EV infrastructure planning while supporting sustainable mobility transitions.

artificial intelligence, data mining, machine learning, (16 more...)

2510.09048

Country:

Asia > China (1.00)
North America > United States > Tennessee (0.55)

Genre:

Research Report (1.00)
Overview (0.92)

Industry:

Transportation > Passenger (1.00)
Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)