AITopics

2509.05933

Country:

Europe (0.93)
Asia (0.68)
North America > United States (0.67)

Genre: Research Report (0.82)

Industry: Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.87)

arXiv.org Artificial IntelligenceOct-15-2025

mmWave Radar-Based Non-Line-of-Sight Pedestrian Localization at T-Junctions Utilizing Road Layout Extraction via Camera

Park, Byeonggyu, Kim, Hee-Yeun, Choi, Byonghyok, Cho, Hansang, Kim, Byungkwan, Lee, Soomok, Jeon, Mingu, Kim, Seong-Woo

Pedestrians Localization in Non-Line-of-Sight (NLoS) regions within urban environments poses a significant challenge for autonomous driving systems. While mmWave radar has demonstrated potential for detecting objects in such scenarios, the 2D radar point cloud (PCD) data is susceptible to distortions caused by multipath reflections, making accurate spatial inference difficult. Additionally, although camera images provide high-resolution visual information, they lack depth perception and cannot directly observe objects in NLoS regions. In this paper, we propose a novel framework that interprets radar PCD through road layout inferred from camera for localization of NLoS pedestrians. The proposed method leverages visual information from the camera to interpret 2D radar PCD, enabling spatial scene reconstruction. The effectiveness of the proposed approach is validated through experiments conducted using a radar-camera system mounted on a real vehicle. The localization performance is evaluated using a dataset collected in outdoor NLoS driving environments, demonstrating the practical applicability of the method.

artificial intelligence, machine learning, spatial reasoning, (16 more...)

2508.02348

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Ground > Road (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.47)

LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation

Chen, Jun, Hu, Shichao, Lin, Jiuxin, Li, Wenjie, Zhang, Zihan, Li, Xingchen, Liu, JinJiang, Xiao, Longshuai, Weng, Chao, Xie, Lei, Wu, Zhiyong

In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.

machine learning, natural language, real time system, (19 more...)

2510.10687

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

STRAP: Spatio-Temporal Pattern Retrieval for Out-of-Distribution Generalization

Zhang, Haoyu, Zhang, Wentao, Miao, Hao, Jiang, Xinke, Fang, Yuchen, Zhang, Yifan

Spatio-Temporal Graph Neural Networks (STGNNs) have emerged as a powerful tool for modeling dynamic graph-structured data across diverse domains. However, they often fail to generalize in Spatio-Temporal Out-of-Distribution (STOOD) scenarios, where both temporal dynamics and spatial structures evolve beyond the training distribution. To address this problem, we propose an innovative Spatio-Temporal Retrieval-Augmented Pattern Learning framework,STRAP, which enhances model generalization by integrating retrieval-augmented learning into the STGNN continue learning pipeline. The core of STRAP is a compact and expressive pattern library that stores representative spatio-temporal patterns enriched with historical, structural, and semantic information, which is obtained and optimized during the training phase. During inference, STRAP retrieves relevant patterns from this library based on similarity to the current input and injects them into the model via a plug-and-play prompting mechanism. This not only strengthens spatio-temporal representations but also mitigates catastrophic forgetting. Moreover, STRAP introduces a knowledge-balancing objective to harmonize new information with retrieved knowledge. Extensive experiments across multiple real-world streaming graph datasets show that STRAP consistently outperforms state-of-the-art STGNN baselines on STOOD tasks, demonstrating its robustness, adaptability, and strong generalization capability without task-specific fine-tuning.

data mining, machine learning, natural language, (21 more...)

2505.19547

Country: Asia > China (0.93)

Genre: Research Report (1.00)

Industry: Education > Educational Setting > Continuing Education (0.86)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.93)
(2 more...)

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph

Wang, Wentao, Zou, Heqing, Luo, Tianze, Huang, Rui, Zhao, Yutian, Wang, Zhuochen, Zhang, Hansheng, Qin, Chengwei, Wang, Yan, Zhao, Lin, Zhang, Huaijian

Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated strong semantic understanding capabilities, but struggles to perform precise spatio-temporal understanding. Existing spatio-temporal methods primarily focus on the video itself, while overlooking the physical information within the video, such as multi-object layouts and motion. Such limitations restrict the use of MLLMs in downstream applications that demand high precision, including embodied intelligence and VR. To address this issue, we present Video-STR, a novel graph-based reinforcement method for precise Video Spatio-Temporal Reasoning. Building upon the capacity of Reinforcement Learning with V erifi-able Reward (RL VR) to improve model abilities, we introduce a reasoning mechanism using graph-based Group Relative Policy Optimization (GRPO) method to guide the model in inferring the underlying spatio-temporal topology of scenarios during the thinking process. To resolve the lack of spatio-temporal training data, we construct the STV-205k dataset with 205k question-answering pairs, covering dynamic multi-object scenes in both indoor and outdoor environments, to support the model training. Experiments show that Video-STR achieves state-of-the-art results on various benchmarks, outperforming the base model by 13% on STI-Bench, and demonstrating the effectiveness of our approach and dataset. Code, model, and data will be released.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

2510.10976

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Beyond AlphaEarth: Toward Human-Centered Spatial Representation via POI-Guided Contrastive Learning

Liu, Junyuan, Qin, Quan, Dong, Guangsheng, Wang, Xinglei, Feng, Jiazhuang, Zeng, Zichao, Cheng, Tao

General-purpose spatial representations are essential for building transferable geospatial foundation models (GFMs). Among them, the AlphaEarth Foundation (AE) represents a major step toward a global, unified representation of the Earth's surface, learning 10-meter embeddings from multi-source Earth Observation (EO) data that capture rich physical and environmental patterns across diverse landscapes. However, such EO-driven representations remain limited in capturing the functional and socioeconomic dimensions of cities, as they primarily encode physical and spectral patterns rather than human activities or spatial functions. We propose AETHER (AlphaEarth-POI Enriched Representation Learning), a lightweight framework that adapts AlphaEarth to human-centered urban analysis through multimodal alignment guided by Points of Interest (POIs). AETHER aligns AE embeddings with textual representations of POIs, enriching physically grounded EO features with semantic cues about urban functions and socioeconomic contexts. In Greater London, AETHER achieves consistent gains over the AE baseline, with a 7.2% relative improvement in land-use classification F1 and a 23.6% relative reduction in Kullback-Leibler divergence for socioeconomic mapping. Built upon pretrained AE, AETHER leverages a lightweight multimodal alignment to enrich it with human-centered semantics while remaining computationally efficient and scalable for urban applications. By coupling EO with human-centered semantics, it advances geospatial foundation models toward general-purpose urban representations that integrate both physical form and functional meaning. Introduction Understanding the spatial organization and functional dynamics of cities remains a long-standing challenge in GIScience and urban computing. Addressing this challenge requires spatial representations that generalize across scales, modalities, and urban contexts.

artificial intelligence, machine learning, natural language, (19 more...)

2510.09894

Country:

Europe > United Kingdom (0.67)
Asia (0.46)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.69)
Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

OmniSAT: Compact Action Token, Faster Auto Regression

Lyu, Huaihai, Chen, Chaofan, Xie, Senwei, Wang, Pengwei, Chen, Xiansheng, Zhang, Shanghang, Xu, Changsheng

Existing Vision-Language-Action (VLA) models can be broadly categorized into diffusion-based and auto-regressive (AR) approaches: diffusion models capture continuous action distributions but rely on computationally heavy iterative denoising. In contrast, AR models enable efficient optimization and flexible sequence construction, making them better suited for large-scale pretraining. To further improve AR efficiency, particularly when action chunks induce extended and high-dimensional sequences, prior work applies entropy-guided and token-frequency techniques to shorten the sequence length. However, such compression struggled with \textit{poor reconstruction or inefficient compression}. Motivated by this, we introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation. Specifically, we first normalize value ranges and temporal horizons to obtain a consistent representation with B-Spline encoding. Then, we apply multi-stage residual quantization to the position, rotation, and gripper subspaces, producing compressed discrete tokens with coarse-to-fine granularity for each part. After pre-training on the large-scale dataset Droid, the resulting discrete tokenization shortens the training sequence by 6.8$\times$, and lowers the target entropy. To further explore the potential of OmniSAT, we develop a cross-embodiment learning strategy that builds on the unified action-pattern space and jointly leverages robot and human demonstrations. It enables scalable auxiliary supervision from heterogeneous egocentric videos. Across diverse real-robot and simulation experiments, OmniSAT encompasses higher compression while preserving reconstruction quality, enabling faster AR training convergence and model performance.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

2510.09667

Country: Asia (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

arXiv.org Machine LearningOct-14-2025

Population synthesis with geographic coordinates

Lenti, Jacopo, Costantini, Lorenzo, Fosch, Ariadna, Monticelli, Anna, Scala, David, Pangallo, Marco

It is increasingly important to generate synthetic populations with explicit coordinates rather than coarse geographic areas, yet no established methods exist to achieve this. One reason is that latitude and longitude differ from other continuous variables, exhibiting large empty spaces and highly uneven densities. To address this, we propose a population synthesis algorithm that first maps spatial coordinates into a more regular latent space using Normalizing Flows (NF), and then combines them with other features in a Variational Autoencoder (VAE) to generate synthetic populations. This approach also learns the joint distribution between spatial and non-spatial features, exploiting spatial autocorrelations. We demonstrate the method by generating synthetic homes with the same statistical properties of real homes in 121 datasets, corresponding to diverse geographies. We further propose an evaluation framework that measures both spatial accuracy and practical utility, while ensuring privacy preservation. Our results show that the NF+VAE architecture outperforms popular benchmarks, including copula-based methods and uniform allocation within geographic areas. The ability to generate geolocated synthetic populations at fine spatial resolution opens the door to applications requiring detailed geography, from household responses to floods, to epidemic spread, evacuation planning, and transport modeling.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2510.09669

Country:

Europe > Italy (0.30)
North America > United States (0.28)

Genre: Research Report > New Finding (0.86)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance (1.00)
Health & Medicine (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

arXiv.org Artificial IntelligenceOct-13-2025

AirScape: An Aerial Generative World Model with Motion Controllability

Zhao, Baining, Tang, Rongze, Jia, Mingyuan, Wang, Ziyou, Man, Fanghang, Zhang, Xin, Shang, Yu, Zhang, Weichen, Wu, Wei, Gao, Chen, Chen, Xinlei, Li, Yong

How to enable agents to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore general spatial imagination capability, we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct a dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase schedule to train a foundation model--initially devoid of embodied spatial knowledge--into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints. Experimental results demonstrate that AirScape significantly outperforms existing foundation models in 3D spatial imagination capabilities, especially with over a 50% improvement in metrics reflecting motion alignment. The project is available at: https://embodiedcity.github.io/AirScape/.

large language model, machine learning, natural language, (18 more...)

2507.08885

Country:

Europe (0.47)
Asia > China > Guangdong Province (0.15)

Genre: Research Report (0.84)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
(2 more...)

arXiv.org Artificial IntelligenceOct-13-2025

Geo-Aware Models for Stream Temperature Prediction across Different Spatial Regions and Scales

Luo, Shiyuan, Yu, Runlong, Chen, Shengyu, Fan, Yingda, Xie, Yiqun, Li, Yanhua, Jia, Xiaowei

Understanding environmental ecosystems is vital for the sustainable management of our planet. However,existing physics-based and data-driven models often fail to generalize to varying spatial regions and scales due to the inherent data heterogeneity presented in real environmental ecosystems. This generalization issue is further exacerbated by the limited observation samples available for model training. To address these issues, we propose Geo-STARS, a geo-aware spatio-temporal modeling framework for predicting stream water temperature across different watersheds and spatial scales. The major innovation of Geo-STARS is the introduction of geo-aware embedding, which leverages geographic information to explicitly capture shared principles and patterns across spatial regions and scales. We further integrate the geo-aware embedding into a gated spatio-temporal graph neural network. This design enables the model to learn complex spatial and temporal patterns guided by geographic and hydrological context, even with sparse or no observational data. We evaluate Geo-STARS's efficacy in predicting stream water temperature, which is a master factor for water quality. Using real-world datasets spanning 37 years across multiple watersheds along the eastern coast of the United States, Geo-STARS demonstrates its superior generalization performance across both regions and scales, outperforming state-of-the-art baselines. These results highlight the promise of Geo-STARS for scalable, data-efficient environmental monitoring and decision-making.

artificial intelligence, machine learning, stream characteristic, (18 more...)

2510.095

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.16)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Water & Waste Management > Water Management (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)