AITopics

2509.26091

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

STOAT: Spatial-Temporal Probabilistic Causal Inference Network

Yang, Yang, Yin, Du, Xue, Hao, Salim, Flora

Spatial-temporal causal time series (STC-TS) involve region-specific temporal observations driven by causally relevant covariates and interconnected across geographic or network-based spaces. Existing methods often model spatial and temporal dynamics independently and overlook causality-driven probabilistic forecasting, limiting their predictive power. To address this, we propose STOAT (Spatial-Temporal Probabilistic Causal Inference Network), a novel framework for probabilistic forecasting in STC-TS. The proposed method extends a causal inference approach by incorporating a spatial relation matrix that encodes interregional dependencies (e.g. proximity or connectivity), enabling spatially informed causal effect estimation. The resulting latent series are processed by deep probabilistic models to estimate the parameters of the distributions, enabling calibrated uncertainty modeling. We further explore multiple output distributions (e.g., Gaussian, Student's-$t$, Laplace) to capture region-specific variability. Experiments on COVID-19 data across six countries demonstrate that STOAT outperforms state-of-the-art probabilistic forecasting models (DeepAR, DeepVAR, Deep State Space Model, etc.) in key metrics, particularly in regions with strong spatial dependencies. By bridging causal inference and geospatial probabilistic forecasting, STOAT offers a generalizable framework for complex spatial-temporal tasks, such as epidemic management.

artificial intelligence, forecasting, machine learning, (17 more...)

doi: 10.1145/3748636.3762761

2506.09544

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.16)
Oceania > Australia > New South Wales (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Epidemiology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Trends in Motion Prediction Toward Deployable and Generalizable Autonomy: A Revisit and Perspectives

Wang, Letian, Lavoie, Marc-Antoine, Papais, Sandro, Nisar, Barza, Chen, Yuxiao, Ding, Wenhao, Ivanovic, Boris, Shao, Hao, Abuduweili, Abulikemu, Cook, Evan, Zhou, Yang, Karkus, Peter, Li, Jiachen, Liu, Changliu, Pavone, Marco, Waslander, Steven

international intelligent transportation systems conference, large language model, machine learning, (24 more...)

Motion prediction, recently popularized under the term world models, refers to anticipating the future states of agents or the future evolution of a scene, which is rooted in human cognition to bridge perception and decision-making, enabling us to anticipate, adapt, and act within an ever-changing world. It lies at the core of intelligent autonomous systems, such as robotics and self-driving cars, to safely operate in dynamic and human-robot-mixed environments, and also informs broader time-series challenges. With advances in methods, representations, and datasets, the field has seen rapid progress, reflected in rapidly updated benchmark performance. However, when state-of-the-art methods are deployed in the real world, they are often found to struggle to generalize to open-world settings and fall short of deployment standards. This reveals a gap between reality and benchmarks, which are often idealized or ill-posed, and fail to capture real-world complexity. To address the pressing need for problem settings that better reflect real-world challenges and guide future research, this paper focuses on revisiting the generalization and applicability of motion prediction models, with an emphasis on robotics, autonomous driving, and human motion applications. We first provide a comprehensive taxonomy of motion prediction methods, covering representations, modelling methods, application domains, and evaluation protocols. We then revisit two fundamental problems: 1) how to push motion prediction models to be deployable to realistic deployment standards, where motion prediction does not act in a vacuum, but functions as one module of closed-loop autonomy stacks - it takes input from the localization and perception, and informs downstream planning and control.

2505.09074

Country:

Europe (1.00)
Asia (1.00)
North America > United States > California (0.45)
North America > Canada > Ontario > Toronto (0.28)

Genre:

Research Report > Promising Solution (1.00)
Overview (1.00)
Research Report > New Finding (0.92)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
Leisure & Entertainment > Games (1.00)
(8 more...)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
(9 more...)

Intuitive Programming, Adaptive Task Planning, and Dynamic Role Allocation in Human-Robot Collaboration

Lagomarsino, Marta, Merlo, Elena, Pupa, Andrea, Birr, Timo, Krebs, Franziska, Secchi, Cristian, Asfour, Tamim, Ajoudani, Arash

Remarkable capabilities have been achieved by robotics and AI, mastering complex tasks and environments. Yet, humans often remain passive observers, fascinated but uncertain how to engage. Robots, in turn, cannot reach their full potential in human-populated environments without effectively modeling human states and intentions and adapting their behavior. To achieve a synergistic human-robot collaboration (HRC), a continuous information flow should be established: humans must intuitively communicate instructions, share expertise, and express needs. In parallel, robots must clearly convey their internal state and forthcoming actions to keep users informed, comfortable, and in control. This review identifies and connects key components enabling intuitive information exchange and skill transfer between humans and robots. We examine the full interaction pipeline: from the human-to-robot communication bridge translating multimodal inputs into robot-understandable representations, through adaptive planning and role allocation, to the control layer and feedback mechanisms to close the loop. Finally, we highlight trends and promising directions toward more adaptive, accessible HRC.

artificial intelligence, machine learning, robot, (20 more...)

doi: 10.1146/annurev-control-022624-013624

2511.08732

Country: Europe (0.93)

Genre: Overview (1.00)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(6 more...)

Jo, Hae-Won, Cho, Yeong-Jun

RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation

Dynamic Scene Graph Generation (DSGG) models how object relations evolve over time in videos. However, existing methods are trained only on annotated object pairs and lack guidance for non-related pairs, making it difficult to identify meaningful relations during inference. In this paper, we propose Relation Scoring Network (RS-Net), a modular framework that scores the contextual importance of object pairs using both spatial interactions and long-range temporal context. RS-Net consists of a spatial context encoder with learnable context tokens and a temporal encoder that aggregates video-level information. The resulting relation scores are integrated into a unified triplet scoring mechanism to enhance relation prediction. RS-Net can be easily integrated into existing DSGG models without architectural changes. Experiments on the Action Genome dataset show that RS-Net consistently improves both Recall and Precision across diverse baselines, with notable gains in mean Recall, highlighting its ability to address the long-tailed distribution of relations. Despite the increased number of parameters, RS-Net maintains competitive efficiency, achieving superior performance over state-of-the-art methods.

machine learning, natural language, relation, (14 more...)

2511.08651

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

Shen, Minghe, Zhi, Zhuo, Liu, Chonghan, Xing, Shuo, Tu, Zhengzhong, Liu, Che

While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model's initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model's fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.

large language model, machine learning, reinforcement learning, (19 more...)

2511.0071

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.56)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

Large Pre-Trained Models for Bimanual Manipulation in 3D

Yurchyk, Hanna, Chang, Wei-Di, Dudek, Gregory, Meger, David

We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

artificial intelligence, machine learning, natural language, (17 more...)

doi: 10.1109/Humanoids65713.2025.11203079

2509.20579

Country: North America > United States (0.46)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(3 more...)

arXiv.org Artificial IntelligenceNov-12-2025

Capturing Complex Spatial-Temporal Dependencies in Traffic Forecasting: A Self-Attention Approach

Chenghong, Zheng, Deng, Zongyin, Cheng, Liu, Simin, Xiong, Deshi, Di, Guanyao, Li

We study the problem of traffic forecasting, aiming to predict the inflow and outflow of a region in the subsequent time slot. The problem is complex due to the intricate spatial and temporal interdependence among regions. Prior works study the spatial and temporal dependency in a decouple manner, failing to capture their joint effect. In this work, we propose ST-SAM, a novel and efficient Spatial-Temporal Self-Attention Model for traffic forecasting. ST-SAM uses a region embedding layer to learn time-specific embedding from traffic data for regions. Then, it employs a spatial-temporal dependency learning module based on self-attention mechanism to capture the joint spatial-temporal dependency for both nearby and faraway regions. ST-SAM entirely relies on self-attention to capture both local and global spatial-temporal correlations, which make it effective and efficient. Extensive experiments on two real world datasets show that ST-SAM is substantially more accurate and efficient than the state-of-the-art approaches (with an average improvement of up to 15% on RMSE, 17% on MAPE, and 32 times on training time in our experiments).

artificial intelligence, dependency, machine learning, (19 more...)

2511.0798

Country: Asia > China > Guangdong Province (0.15)

Genre: Research Report > New Finding (0.67)

Industry: Transportation (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-12-2025

ViPRA: Video Prediction for Robot Actions

Routray, Sandeep, Pan, Hengkai, Jain, Unnat, Bahl, Shikhar, Pathak, Deepak

Can we turn a video prediction model into a robot policy? Videos, including those of humans or teleoperated robots, capture rich physical interactions. However, most of them lack labeled actions, which limits their use in robot learning. We present Video Prediction for Robot Actions (ViPRA), a simple pretraining-finetuning framework that learns continuous robot control from these actionless videos. Instead of directly predicting actions, we train a video-language model to predict both future visual observations and motion-centric latent actions, which serve as intermediate representations of scene dynamics. We train these latent actions using perceptual losses and optical flow consistency to ensure they reflect physically grounded behavior. For downstream control, we introduce a chunked flow matching decoder that maps latent actions to robot-specific continuous action sequences, using only 100 to 200 teleoperated demonstrations. This approach avoids expensive action annotation, supports generalization across embodiments, and enables smooth, high-frequency continuous control upto 22 Hz via chunked action decoding. Unlike prior latent action works that treat pretraining as autoregressive policy learning, explicitly models both what changes and how. Our method outperforms strong baselines, with a 16% gain on the SIMPLER benchmark and a 13% improvement across real world manipulation tasks. We will release models and code at https://vipra-project.github.io

artificial intelligence, arxiv e-print, spatial reasoning, (18 more...)

2511.07732

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.45)

arXiv.org Artificial IntelligenceNov-12-2025

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Yu, Songsong, Chen, Yuxin, Ju, Hao, Jia, Lianjie, Zhang, Fuxi, Huang, Shaofei, Wu, Yuhan, Cui, Rundi, Ran, Binghao, Zhang, Zaibin, Zheng, Zhedong, Zhang, Zhipeng, Wang, Yifan, Song, Lin, Wang, Lijun, Li, Yanwei, Shan, Ying, Lu, Huchuan

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

large language model, machine learning, natural language, (19 more...)

2509.18905

Country: Asia > China (0.46)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)