AITopics

2510.12276

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceOct-17-2025

SUM-AgriVLN: Spatial Understanding Memory for Agricultural Vision-and-Language Navigation

Zhao, Xiaobei, Lyu, Xingqi, Li, Xiang

Agricultural robots are emerging as powerful assistants across a wide range of agricultural tasks, nevertheless, still heavily rely on manual operation or fixed rail systems for movement. The AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling robots to navigate to the target positions following the natural language instructions. In practical agricultural scenarios, navigation instructions often repeatedly occur, yet AgriVLN treat each instruction as an independent episode, overlooking the potential of past experiences to provide spatial context for subsequent ones. To bridge this gap, we propose the method of Spatial Understanding Memory for Agricultural Vision-and-Language Navigation (SUM-AgriVLN), in which the SUM module employs spatial understanding and save spatial memory through 3D reconstruction and representation. When evaluated on the A2A benchmark, our SUM-AgriVLN effectively improves Success Rate from 0.47 to 0.54 with slight sacrifice on Navigation Error from 2.91m to 2.93m, demonstrating the state-of-the-art performance in the agricultural domain. Code: https://github.com/AlexTraveling/SUM-AgriVLN.

artificial intelligence, natural language, spatial reasoning, (18 more...)

2510.14357

Genre: Research Report (0.82)

Industry: Transportation > Ground > Rail (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.66)

BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals

Xiao, Qinfan, Cui, Ziyun, Zhang, Chi, Chen, Siqi, Wu, Wen, Thwaites, Andrew, Woolgar, Alexandra, Zhou, Bowen, Zhang, Chao

Electroencephalography (EEG) and magnetoencephalography (MEG) measure neural activity non-invasively by capturing electromagnetic fields generated by dendritic currents. Although rooted in the same biophysics, EEG and MEG exhibit distinct signal patterns, further complicated by variations in sensor configurations across modalities and recording devices. Existing approaches typically rely on separate, modality- and dataset-specific models, which limits the performance and cross-domain scalability. This paper proposes BrainOmni, the first brain foundation model that generalises across heterogeneous EEG and MEG recordings. To unify diverse data sources, we introduce BrainTokenizer,the first tokenizer that quantises spatiotemporal brain activity into discrete representations. Central to BrainTokenizer is a novel Sensor Encoder that encodes sensor properties such as spatial layout, orientation, and type, enabling compatibility across devices and modalities. Building upon the discrete representations, BrainOmni learns unified semantic embeddings of brain signals by self-supervised pretraining. To the best of our knowledge, it is the first foundation model to support both EEG and MEG signals, as well as the first to incorporate large-scale MEG pretraining. A total of 1,997 hours of EEG and 656 hours of MEG data are curated and standardised from publicly available sources for pretraining. Experiments show that BrainOmni outperforms both existing foundation models and state-of-the-art task-specific models on a range of downstream tasks. It also demonstrates strong generalisation to unseen EEG and MEG devices. Further analysis reveals that joint EEG-MEG (EMEG) training yields consistent improvements across both modalities. Code and checkpoints are publicly available at https://github.com/OpenTSLab/BrainOmni.

artificial intelligence, machine learning, natural language, (20 more...)

2505.18185

Country:

Asia (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Chen, Xinyi, Chen, Yilun, Fu, Yanwei, Gao, Ning, Jia, Jiaya, Jin, Weiyang, Li, Hao, Mu, Yao, Pang, Jiangmiao, Qiao, Yu, Tian, Yang, Wang, Bin, Wang, Bolun, Wang, Fangjing, Wang, Hanqing, Wang, Tai, Wang, Ziqin, Wei, Xueyuan, Wu, Chao, Yang, Shuai, Ye, Jinhui, Yu, Junqiu, Zeng, Jia, Zhang, Jingjing, Zhang, Jinyu, Zhang, Shi, Zheng, Feng, Zhou, Bowen, Zhu, Yangkun

We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.

arxiv preprint arxiv, large language model, natural language, (18 more...)

2510.13778

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.54)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

Mühlematter, Dominik J., Che, Lin, Hong, Ye, Raubal, Martin, Wiedemann, Nina

Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at https://github.com/DominikM198/UrbanFusion.

large language model, machine learning, natural language, (17 more...)

2510.13774

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report > New Finding (0.92)

Industry:

Banking & Finance > Real Estate (0.66)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.47)
Transportation > Ground > Road (0.45)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)

Lau, Lap Chi, Ramachandran, Akshay

Optimal Bounds for Tyler's M-Estimator for Elliptical Distributions

A fundamental problem in statistics is estimating the shape matrix of an Elliptical distribution. This generalizes the familiar problem of Gaussian covariance estimation, for which the sample covariance achieves optimal estimation error. For Elliptical distributions, Tyler proposed a natural M-estimator and showed strong statistical properties in the asymptotic regime, independent of the underlying distribution. Numerical experiments show that this estimator performs very well, and that Tyler's iterative procedure converges quickly to the estimator. Franks and Moitra recently provided the first distribution-free error bounds in the finite sample setting, as well as the first rigorous convergence analysis of Tyler's iterative procedure. However, their results exceed the sample complexity of the Gaussian setting by a $\log^{2} d$ factor. We close this gap by proving optimal sample threshold and error bounds for Tyler's M-estimator for all Elliptical distributions, fully matching the Gaussian result. Moreover, we recover the algorithmic convergence even at this lower sample threshold. Our approach builds on the operator scaling connection of Franks and Moitra by introducing a novel pseudorandom condition, which we call $\infty$-expansion. We show that Elliptical distributions satisfy $\infty$-expansion at the optimal sample threshold, and then prove a novel scaling result for inputs satisfying this condition.

artificial intelligence, machine learning, theorem 2, (15 more...)

2510.13751

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.35)

Rethinking Graph Domain Adaptation: A Spectral Contrastive Perspective

Zhang, Haoyu, Cheng, Yuxuan, Fan, Wenqi, Chen, Yulong, Zhang, Yifan

Graph neural networks (GNNs) have achieved remarkable success in various domains, yet they often struggle with domain adaptation due to significant structural distribution shifts and insufficient exploration of transferable patterns. One of the main reasons behind this is that traditional approaches do not treat global and local patterns discriminatingly so that some local details in the graph may be violated after multi-layer GNN. Our key insight is that domain shifts can be better understood through spectral analysis, where low-frequency components often encode domain-invariant global patterns, and high-frequency components capture domain-specific local details. As such, we propose FracNet (Fr equency A ware C ontrastive Graph Net work) with two synergic modules to decompose the original graph into high-frequency and low-frequency components and perform frequency-aware domain adaption. Moreover, the blurring boundary problem of domain adaptation is improved by integrating with a contrastive learning framework. Besides the practical implication, we also provide rigorous theoretical proof to demonstrate the superiority of FracNet. Extensive experiments further demonstrate significant improvements over state-of-the-art approaches.

artificial intelligence, machine learning, spatial reasoning, (12 more...)

doi: 10.1007/978-3-032-06106-5_26

2510.13254

Country: Asia > China (0.29)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.46)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)

EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control

Qu, Delin, Song, Haoming, Chen, Qizhi, Chen, Zhaoqing, Gao, Xianqiang, Ye, Xinyi, Lv, Qi, Shi, Modi, Ren, Guanghui, Ruan, Cheng, Yao, Maoqing, Yang, Haoran, Bao, Jiacheng, Zhao, Bin, Wang, Dong

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.

artificial intelligence, large language model, natural language, (17 more...)

2508.21112

Country: Asia (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Amorosa, Lorenzo Mario, Conti, Francesco, Quercioli, Nicola, Zabini, Flavio, Mahyari, Tayebeh Lotfi, Ge, Yiqun, Frosini, Patrizio

Reconstruction of SINR Maps from Sparse Measurements using Group Equivariant Non-Expansive Operators

arXiv.org Artificial IntelligenceOct-15-2025

As sixth generation (6G) wireless networks evolve, accurate signal-to-interference-noise ratio (SINR) maps are becoming increasingly critical for effective resource management and optimization. However, acquiring such maps at high resolution is often cost-prohibitive, creating a severe data scarcity challenge. This necessitates machine learning (ML) approaches capable of robustly reconstructing the full map from extremely sparse measurements. To address this, we introduce a novel reconstruction framework based on Group Equivariant Non-Expansive Operators (GENEOs). Unlike data-hungry ML models, GENEOs are low-complexity operators that embed domain-specific geometric priors, such as translation invariance, directly into their structure. This provides a strong inductive bias, enabling effective reconstruction from very few samples. Our key insight is that for network management, preserving the topological structure of the SINR map, such as the geometry of coverage holes and interference patterns, is often more critical than minimizing pixel-wise error. We validate our approach on realistic ray-tracing-based urban scenarios, evaluating performance with both traditional statistical metrics (mean squared error (MSE)) and, crucially, a topological metric (1-Wasserstein distance). Results show that while maintaining competitive MSE, our method dramatically outperforms established ML baselines in topological fidelity. This demonstrates the practical advantage of GENEOs for creating structurally accurate SINR maps that are more reliable for downstream network optimization tasks.

artificial intelligence, machine learning, spatial reasoning, (19 more...)

2507.19349

Country:

Europe (1.00)
North America > United States (0.93)

Genre: Research Report > New Finding (0.34)

Industry:

Telecommunications (0.46)
Energy (0.34)

Technology:

Information Technology > Communications > Networks > Sensor Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.66)

Sabir, Muhammad Ayub, Pang, Junbiao, Wu, Jiaqi, Ashraf, Fatima

Few Shot Semi-Supervised Learning for Abnormal Stop Detection from Sparse GPS Trajectories

arXiv.org Artificial IntelligenceOct-15-2025

Abnormal stop detection (ASD) in intercity coach transportation is critical for ensuring passenger safety, operational reliability, and regulatory compliance. However, two key challenges hinder ASD effectiveness: sparse GPS trajectories, which obscure short or unauthorized stops, and limited labeled data, which restricts supervised learning. Existing methods often assume dense sampling or regular movement patterns, limiting their applicability. To address data sparsity, we propose a Sparsity-Aware Segmentation (SAS) method that adaptively defines segment boundaries based on local spatial-temporal density. Building upon these segments, we introduce three domain-specific indicators to capture abnormal stop behaviors. To further mitigate the impact of sparsity, we develop Locally Temporal-Indicator Guided Adjustment (LTIGA), which smooths these indicators via local similarity graphs. To overcome label scarcity, we construct a spatial-temporal graph where each segment is a node with LTIGA-refined features. We apply label propagation to expand weak supervision across the graph, followed by a GCN to learn relational patterns. A final self-training module incorporates high-confidence pseudo-labels to iteratively improve predictions. Experiments on real-world coach data show an AUC of 0.854 and AP of 0.866 using only 10 labeled instances, outperforming prior methods. The code and dataset are publicly available at \href{https://github.com/pangjunbiao/Abnormal-Stop-Detection-SSL.git}

artificial intelligence, detection, machine learning, (17 more...)

2510.12686

Genre:

Research Report (0.64)
Instructional Material (0.54)

Industry: Transportation > Passenger (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)