Goto

Collaborating Authors

 Spatial Reasoning


Nature's Insight: A Novel Framework and Comprehensive Analysis of Agentic Reasoning Through the Lens of Neuroscience

arXiv.org Artificial Intelligence

Autonomous AI is no longer a hard-to-reach concept, it enables the agents to move beyond executing tasks to independently addressing complex problems, adapting to change while handling the uncertainty of the environment. However, what makes the agents truly autonomous? It is agentic reasoning, that is crucial for foundation models to develop symbolic logic, statistical correlations, or large-scale pattern recognition to process information, draw inferences, and make decisions. However, it remains unclear why and how existing agentic reasoning approaches work, in comparison to biological reasoning, which instead is deeply rooted in neural mechanisms involving hierarchical cognition, multimodal integration, and dynamic interactions. In this work, we propose a novel neuroscience-inspired framework for agentic reasoning. Grounded in three neuroscience-based definitions and supported by mathematical and biological foundations, we propose a unified framework modeling reasoning from perception to action, encompassing four core types, perceptual, dimensional, logical, and interactive, inspired by distinct functional roles observed in the human brain. We apply this framework to systematically classify and analyze existing AI reasoning methods, evaluating their theoretical foundations, computational designs, and practical limitations. We also explore its implications for building more generalizable, cognitively aligned agents in physical and virtual environments. Finally, building on our framework, we outline future directions and propose new neural-inspired reasoning methods, analogous to chain-of-thought prompting. By bridging cognitive neuroscience and AI, this work offers a theoretical foundation and practical roadmap for advancing agentic reasoning in intelligent systems. The associated project can be found at: https://github.com/BioRAILab/Awesome-Neuroscience-Agent-Reasoning .


SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

arXiv.org Artificial Intelligence

This study introduces SpatialPrompting, a novel framework that harnesses the emergent reasoning capabilities of off-the-shelf multimodal large language models to achieve zero-shot spatial reasoning in three-dimensional (3D) environments. Unlike existing methods that rely on expensive 3D-specific fine-tuning with specialized 3D inputs such as point clouds or voxel-based features, SpatialPrompting employs a keyframe-driven prompt generation strategy. This framework uses metrics such as vision-language similarity, Mahalanobis distance, field of view, and image sharpness to select a diverse and informative set of keyframes from image sequences and then integrates them with corresponding camera pose data to effectively abstract spatial relationships and infer complex 3D structures. The proposed framework not only establishes a new paradigm for flexible spatial reasoning that utilizes intuitive visual and positional cues but also achieves state-of-the-art zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across several metrics. The proposed method effectively eliminates the need for specialized 3D inputs and fine-tuning, offering a simpler and more scalable alternative to conventional approaches.


FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

arXiv.org Artificial Intelligence

Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/


Sightful Spacetop for Windows Review: Spatial Computing Is Here

WIRED

I've been eagerly awaiting the advent of spatial computing. My home office desk setup, with multiple screens and browser windows, helps me be very productive. But on the go, I'm relegated to a laptop's 13-inch screen (or packing a portable monitor), and I'm not as efficient. Spatial computing--usually driven by a mixed reality headset or smart glasses--lets you craft a multi-monitor virtual workspace, where you can place apps and browser windows around your periphery, to replicate the experience you have set up at home or the office. Or you can take it a step further because you're only limited by your imagination.


mwBTFreddy: A Dataset for Flash Flood Damage Assessment in Urban Malawi

arXiv.org Artificial Intelligence

This paper describes the mwBTFreddy dataset, a resource developed to support flash flood damage assessment in urban Malawi, specifically focusing on the impacts of Cyclone Freddy in 2023. The dataset comprises paired pre- and post-disaster satellite images sourced from Google Earth Pro, accompanied by JSON files containing labelled building annotations with geographic coordinates and damage levels (no damage, minor, major, or destroyed). Developed by the Kuyesera AI Lab at the Malawi University of Business and Applied Sciences, this dataset is intended to facilitate the development of machine learning models tailored to building detection and damage classification in African urban contexts. It also supports flood damage visualisation and spatial analysis to inform decisions on relocation, infrastructure planning, and emergency response in climate-vulnerable regions.


Explainable AI in Spatial Analysis

arXiv.org Artificial Intelligence

A key objective in spatial analysis is to model spatial relationships and infer spatial processes to generate knowledge from spatial data, which has been largely based on spatial statistical methods. More recently, machine learning offers scalable and flexible approach es that complement traditional methods and has been increasingly applied in spatial data science . Despite its advantages, machine learning is often criticized for being a black box, which limits our understanding of model behavior and output . Recognizing this limitation, XAI has emerged as a pivotal field in AI that provides methods to explain the output of machine learning models to enhance transparency and understanding. These methods are crucial for model diagnosis, bias detection, and ensuring the reliability of results obtained from machine learning models. This chapter introduces key concepts and methods in XAI with a focus on Shapley value - based approach es, which is arguably the most popular XAI method, and their integration with spatial analysis. An empirical example of county - level voting behaviors in the 2020 Presidential election is presented to demonstrate the use of Shapley values and spatial analysis with a comparison to multi - scale geograp hically weighted regression . The chapter concludes with a discussion on the challenges and limitations of current XAI techniques and proposes new directions .


DeepSTA: A Spatial-Temporal Attention Network for Logistics Delivery Timely Rate Prediction in Anomaly Conditions

arXiv.org Artificial Intelligence

Prediction of couriers' delivery timely rates in advance is essential to the logistics industry, enabling companies to take preemptive measures to ensure the normal operation of delivery services. This becomes even more critical during anomaly conditions like the epidemic outbreak, during which couriers' delivery timely rate will decline markedly and fluctuates significantly. Existing studies pay less attention to the logistics scenario. Moreover, many works focusing on prediction tasks in anomaly scenarios fail to explicitly model abnormal events, e.g., treating external factors equally with other features, resulting in great information loss. Further, since some anomalous events occur infrequently, traditional data-driven methods perform poorly in these scenarios. To deal with them, we propose a deep spatial-temporal attention model, named DeepSTA. To be specific, to avoid information loss, we design an anomaly spatio-temporal learning module that employs a recurrent neural network to model incident information. Additionally, we utilize Node2vec to model correlations between road districts, and adopt graph neural networks and long short-term memory to capture the spatial-temporal dependencies of couriers. To tackle the issue of insufficient training data in abnormal circumstances, we propose an anomaly pattern attention module that adopts a memory network for couriers' anomaly feature patterns storage via attention mechanisms. The experiments on real-world logistics datasets during the COVID-19 outbreak in 2022 show the model outperforms the best baselines by 12.11% in MAE and 13.71% in MSE, demonstrating its superior performance over multiple competitive baselines.


Learning to Estimate Package Delivery Time in Mixed Imbalanced Delivery and Pickup Logistics Services

arXiv.org Artificial Intelligence

Accurately estimating package delivery time is essential to the logistics industry, which enables reasonable work allocation and on-time service guarantee. This becomes even more necessary in mixed logistics scenarios where couriers handle a high volume of delivery and a smaller number of pickup simultaneously. However, most of the related works treat the pickup and delivery patterns on couriers' decision behavior equally, neglecting that the pickup has a greater impact on couriers' decision-making compared to the delivery due to its tighter time constraints. In such context, we have three main challenges: 1) multiple spatiotemporal factors are intricately interconnected, significantly affecting couriers' delivery behavior; 2) pickups have stricter time requirements but are limited in number, making it challenging to model their effects on couriers' delivery process; 3) couriers' spatial mobility patterns are critical determinants of their delivery behavior, but have been insufficiently explored. To deal with these, we propose TransPDT, a Transformer-based multi-task package delivery time prediction model. We first employ the Transformer encoder architecture to capture the spatio-temporal dependencies of couriers' historical travel routes and pending package sets. Then we design the pattern memory to learn the patterns of pickup in the imbalanced dataset via attention mechanism. We also set the route prediction as an auxiliary task of delivery time prediction, and incorporate the prior courier spatial movement regularities in prediction. Extensive experiments on real industry-scale datasets demonstrate the superiority of our method. A system based on TransPDT is deployed internally in JD Logistics to track more than 2000 couriers handling hundreds of thousands of packages per day in Beijing.


Temporal Attention Evolutional Graph Convolutional Network for Multivariate Time Series Forecasting

arXiv.org Artificial Intelligence

Multivariate time series forecasting enables the prediction of future states by leveraging historical data, thereby facilitating decision-making processes. Each data node in a multivariate time series encompasses a sequence of multiple dimensions. These nodes exhibit interdependent relationships, forming a graph structure. While existing prediction methods often assume a fixed graph structure, many real-world scenarios involve dynamic graph structures. Moreover, interactions among time series observed at different time scales vary significantly. To enhance prediction accuracy by capturing precise temporal and spatial features, this paper introduces the Temporal Attention Evolutional Graph Convolutional Network (TAEGCN). This novel method not only integrates causal temporal convolution and a multi-head self-attention mechanism to learn temporal features of nodes, but also construct the dynamic graph structure based on these temporal features to keep the consistency of the changing in spatial feature with temporal series. TAEGCN adeptly captures temporal causal relationships and hidden spatial dependencies within the data. Furthermore, TAEGCN incorporates a unified neural network that seamlessly integrates these components to generate final predictions. Experimental results conducted on two public transportation network datasets, METR-LA and PEMS-BAY, demonstrate the superior performance of the proposed model.


SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data

arXiv.org Artificial Intelligence

Vision-language models (VLMs) work well in tasks ranging from image captioning to visual question answering (VQA), yet they struggle with spatial reasoning, a key skill for understanding our physical world that humans excel at. We find that spatial relations are generally rare in widely used VL datasets, with only a few being well represented, while most form a long tail of underrepresented relations. This gap leaves VLMs ill-equipped to handle diverse spatial relationships. To bridge it, we construct a synthetic VQA dataset focused on spatial reasoning generated from hyper-detailed image descriptions in Localized Narratives, DOCCI, and PixMo-Cap. Our dataset consists of 455k samples containing 3.4 million QA pairs. Trained on this dataset, our Spatial-Reasoning Enhanced (SpaRE) VLMs show strong improvements on spatial reasoning benchmarks, achieving up to a 49% performance gain on the What's Up benchmark, while maintaining strong results on general tasks. Our work narrows the gap between human and VLM spatial reasoning and makes VLMs more capable in real-world tasks such as robotics and navigation.