Spatial Reasoning
Two-component spatiotemporal template for activation-inhibition of speech in ECoG
I compute the average trial-by-trial power of band-limited speech activity across epochs of multi-channel high-density electrocorticography (ECoG) recorded from multiple subjects during a consonant-vowel speaking task. I show that previously seen anti-correlations of average beta frequency activity (12-35 Hz) to high-frequency gamma activity (70-140 Hz) during speech movement are observable between individual ECoG channels in the sensorimotor cortex (SMC). With this I fit a variance-based model using principal component analysis to the band-powers of individual channels of session-averaged ECoG data in the SMC and project SMC channels onto their lower-dimensional principal components. Spatiotemporal relationships between speech-related activity and principal components are identified by correlating the principal components of both frequency bands to individual ECoG channels over time using windowed correlation. Correlations of principal component areas to sensorimotor areas reveal a distinct two-component activation-inhibition-like representation for speech that resembles distinct local sensorimotor areas recently shown to have complex interplay in whole-body motor control, inhibition, and posture. Notably the third principal component shows insignificant correlations across all subjects, suggesting two components of ECoG are sufficient to represent SMC activity during speech movement.
MapQaTor: A System for Efficient Annotation of Map Query Datasets
Dihan, Mahir Labib, Ali, Mohammed Eunus, Parvez, Md Rizwan
Mapping and navigation services like Google Maps, Apple Maps, Openstreet Maps, are essential for accessing various location-based data, yet they often struggle to handle natural language geospatial queries. Recent advancements in Large Language Models (LLMs) show promise in question answering (QA), but creating reliable geospatial QA datasets from map services remains challenging. We introduce MapQaTor, a web application that streamlines the creation of reproducible, traceable map-based QA datasets. With its plug-and-play architecture, MapQaTor enables seamless integration with any maps API, allowing users to gather and visualize data from diverse sources with minimal setup. By caching API responses, the platform ensures consistent ground truth, enhancing the reliability of the data even as real-world information evolves. MapQaTor centralizes data retrieval, annotation, and visualization within a single platform, offering a unique opportunity to evaluate the current state of LLM-based geospatial reasoning while advancing their capabilities for improved geospatial understanding. Evaluation metrics show that, MapQaTor speeds up the annotation process by at least 30 times compared to manual methods, underscoring its potential for developing geospatial resources, such as complex map reasoning datasets. The website is live at: https://mapqator.github.io/ and a demo video is available at: https://youtu.be/7_aV9Wmhs6Q.
Efficient Large-Scale Traffic Forecasting with Transformers: A Spatial Data Management Perspective
Fang, Yuchen, Liang, Yuxuan, Hui, Bo, Shao, Zezhi, Deng, Liwei, Liu, Xu, Jiang, Xinke, Zheng, Kai
Road traffic forecasting is crucial in real-world intelligent transportation scenarios like traffic dispatching and path planning in city management and personal traveling. Spatio-temporal graph neural networks (STGNNs) stand out as the mainstream solution in this task. Nevertheless, the quadratic complexity of remarkable dynamic spatial modeling-based STGNNs has become the bottleneck over large-scale traffic data. From the spatial data management perspective, we present a novel Transformer framework called PatchSTG to efficiently and dynamically model spatial dependencies for large-scale traffic forecasting with interpretability and fidelity. Specifically, we design a novel irregular spatial patching to reduce the number of points involved in the dynamic calculation of Transformer. The irregular spatial patching first utilizes the leaf K-dimensional tree (KDTree) to recursively partition irregularly distributed traffic points into leaf nodes with a small capacity, and then merges leaf nodes belonging to the same subtree into occupancy-equaled and non-overlapped patches through padding and backtracking. Based on the patched data, depth and breadth attention are used interchangeably in the encoder to dynamically learn local and global spatial knowledge from points in a patch and points with the same index of patches. Experimental results on four real world large-scale traffic datasets show that our PatchSTG achieves train speed and memory utilization improvements up to $10\times$ and $4\times$ with the state-of-the-art performance.
TrajLearn: Trajectory Prediction Learning using Deep Generative Models
Nadiri, Amirhossein, Li, Jing, Faraji, Ali, Abuoda, Ghadeer, Papagelis, Manos
Trajectory prediction aims to estimate an entity's future path using its current position and historical movement data, benefiting fields like autonomous navigation, robotics, and human movement analytics. Deep learning approaches have become key in this area, utilizing large-scale trajectory datasets to model movement patterns, but face challenges in managing complex spatial dependencies and adapting to dynamic environments. To address these challenges, we introduce TrajLearn, a novel model for trajectory prediction that leverages generative modeling of higher-order mobility flows based on hexagonal spatial representation. TrajLearn predicts the next $k$ steps by integrating a customized beam search for exploring multiple potential paths while maintaining spatial continuity. We conducted a rigorous evaluation of TrajLearn, benchmarking it against leading state-of-the-art approaches and meaningful baselines. The results indicate that TrajLearn achieves significant performance gains, with improvements of up to ~40% across multiple real-world trajectory datasets. In addition, we evaluated different prediction horizons (i.e., various values of $k$), conducted resolution sensitivity analysis, and performed ablation studies to assess the impact of key model components. Furthermore, we developed a novel algorithm to generate mixed-resolution maps by hierarchically subdividing hexagonal regions into finer segments within a specified observation area. This approach supports selective detailing, applying finer resolution to areas of interest or high activity (e.g., urban centers) while using coarser resolution for less significant regions (e.g., rural areas), effectively reducing data storage requirements and computational overhead. We promote reproducibility and adaptability by offering complete code, data, and detailed documentation with flexible configuration options for various applications.
Modeling Continuous Spatial-temporal Dynamics of Turbulent Flow with Test-time Refinement
Chen, Shengyu, Givi, Peyman, Zheng, Can, Jia, Xiaowei
The precise simulation of turbulent flows holds immense significance across various scientific and engineering domains, including climate science, freshwater science, and energy-efficient manufacturing. Within the realm of simulating turbulent flows, large eddy simulation (LES) has emerged as a prevalent alternative to direct numerical simulation (DNS), offering computational efficiency. However, LES cannot accurately capture the full spectrum of turbulent transport scales and is present only at a lower spatial resolution. Reconstructing high-fidelity DNS data from the lower-resolution LES data is essential for numerous applications, but it poses significant challenges to existing super-resolution techniques, primarily due to the complex spatio-temporal nature of turbulent flows. This paper proposes a novel flow reconstruction approach that leverages physical knowledge to model flow dynamics. Different from traditional super-resolution techniques, the proposed approach uses LES data only in the testing phase through a degradation-based refinement approach to enforce physical constraints and mitigate cumulative reconstruction errors over time. Furthermore, a feature sampling strategy is developed to enable flow data reconstruction across different resolutions. The results on two distinct sets of turbulent flow data indicate the effectiveness of the proposed method in reconstructing high-resolution DNS data, preserving the inherent physical attributes of flow transport, and achieving DNS reconstruction at different resolutions.
CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs
Wang, Siyu, Chen, Cailian, Le, Xinyi, Xu, Qimin, Xu, Lei, Zhang, Yanzhou, Yang, Jie
Computer-aided design (CAD) significantly enhances the efficiency, accuracy, and innovation of design processes by enabling precise 2D and 3D modeling, extensive analysis, and optimization. Existing methods for creating CAD models rely on latent vectors or point clouds, which are difficult to obtain and costly to store. Recent advances in Multimodal Large Language Models (MLLMs) have inspired researchers to use natural language instructions and images for CAD model construction. However, these models still struggle with inferring accurate 3D spatial location and orientation, leading to inaccuracies in determining the spatial 3D starting points and extrusion directions for constructing geometries. This work introduces CAD-GPT, a CAD synthesis method with spatial reasoning-enhanced MLLM that takes either a single image or a textual description as input. To achieve precise spatial inference, our approach introduces a 3D Modeling Spatial Mechanism. This method maps 3D spatial positions and 3D sketch plane rotation angles into a 1D linguistic feature space using a specialized spatial unfolding mechanism, while discretizing 2D sketch coordinates into an appropriate planar space to enable precise determination of spatial starting position, sketch orientation, and 2D sketch coordinate translations. Extensive experiments demonstrate that CAD-GPT consistently outperforms existing state-of-the-art methods in CAD model synthesis, both quantitatively and qualitatively.
General Geospatial Inference with a Population Dynamics Foundation Model
Agarwal, Mohit, Sun, Mimi, Kamath, Chaitanya, Muslim, Arbaaz, Sarker, Prithul, Paul, Joydeep, Yee, Hector, Sieniek, Marcin, Jablonski, Kim, Mayer, Yael, Fork, David, de Guia, Sheila, McPike, Jamie, Boulanger, Adam, Shekel, Tomer, Schottlander, David, Xiao, Yao, Manukonda, Manjit Chakravarthy, Liu, Yun, Bulut, Neslihan, Abu-el-haija, Sami, Eigenwillig, Arno, Kothari, Parth, Perozzi, Bryan, Bharel, Monica, Nguyen, Von, Barrington, Luke, Efron, Niv, Matias, Yossi, Corrado, Greg, Eswaran, Krish, Prabhakara, Shruthi, Shetty, Shravya, Prasad, Gautam
Supporting the health and well-being of dynamic populations around the world requires governmental agencies, organizations and researchers to understand and reason over complex relationships between human behavior and local contexts in order to identify high-risk groups and strategically allocate limited resources. Traditional approaches to these classes of problems often entail developing manually curated, task-specific features and models to represent human behavior and the natural and built environment, which can be challenging to adapt to new, or even, related tasks. To address this, we introduce a Population Dynamics Foundation Model (PDFM) that aims to capture the relationships between diverse data modalities and is applicable to a broad range of geospatial tasks. We first construct a geo-indexed dataset for postal codes and counties across the United States, capturing rich aggregated information on human behavior from maps, busyness, and aggregated search trends, and environmental factors such as weather and air quality. We then model this data and the complex relationships between locations using a graph neural network, producing embeddings that can be adapted to a wide range of downstream tasks using relatively simple models. We evaluate the effectiveness of our approach by benchmarking it on 27 downstream tasks spanning three distinct domains: health indicators, socioeconomic factors, and environmental measurements. The approach achieves state-of-the-art performance on all 27 geospatial interpolation tasks, and on 25 out of the 27 extrapolation and super-resolution tasks. We combined the PDFM with a state-of-the-art forecasting foundation model, TimesFM, to predict unemployment and poverty, achieving performance that surpasses fully supervised forecasting. The full set of embeddings and sample code are publicly available for researchers.
SeaMo: A Multi-Seasonal and Multimodal Remote Sensing Foundation Model
Li, Xuyang, Hong, Danfeng, Li, Chenyu, Chanussot, Jocelyn
Remote Sensing (RS) data contains a wealth of multi-dimensional information crucial for Earth observation. Owing to its vast volume, diverse sources, and temporal properties, RS data is highly suitable for the development of large Visual Foundation Models (VFMs). VFMs act as robust feature extractors, learning from extensive RS data, and are subsequently fine-tuned for deployment in various geoscientific tasks. However, current VFMs in the RS domain are predominantly pretrained and tailored exclusively for specific characteristics of RS imagery, neglecting the potential of utilizing the multi-dimensional properties of RS data. Therefore, in this work, we propose SeaMo, a pioneering visual foundation model that integrates multi-seasonal and multimodal information in the RS field. SeaMo is designed to harness multiple properties of RS data. Within the masked image modeling framework, we employ non-aligned cropping techniques to extract spatial properties, use multi-source inputs for multimodal integration, and incorporate temporal-multimodal fusion blocks for effective assimilation of multi-seasonal data. SeaMo explicitly models the multi-dimensional properties of RS data, making the model more comprehensive, robust, and versatile. We applied SeaMo to several downstream geoscience tasks, which demonstrated exceptional performance. Extensive ablation studies were conducted to validate the model's superiority.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Zheng, Ruijie, Liang, Yongyuan, Huang, Shuaiyi, Gao, Jianfeng, Daumé, Hal III, Kolobov, Andrey, Huang, Furong, Yang, Jianwei
Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.
TravelAgent: Generative Agents in the Built Environment
Noyman, Ariel, Hu, Kai, Larson, Kent
Understanding human behavior in built environments is critical for designing functional, user centered urban spaces. Traditional approaches, such as manual observations, surveys, and simplified simulations, often fail to capture the complexity and dynamics of real world behavior. To address these limitations, we introduce TravelAgent, a novel simulation platform that models pedestrian navigation and activity patterns across diverse indoor and outdoor environments under varying contextual and environmental conditions. TravelAgent leverages generative agents integrated into 3D virtual environments, enabling agents to process multimodal sensory inputs and exhibit human-like decision-making, behavior, and adaptation. Through experiments, including navigation, wayfinding, and free exploration, we analyze data from 100 simulations comprising 1898 agent steps across diverse spatial layouts and agent archetypes, achieving an overall task completion rate of 76%. Using spatial, linguistic, and sentiment analyses, we show how agents perceive, adapt to, or struggle with their surroundings and assigned tasks. Our findings highlight the potential of TravelAgent as a tool for urban design, spatial cognition research, and agent-based modeling. We discuss key challenges and opportunities in deploying generative agents for the evaluation and refinement of spatial designs, proposing TravelAgent as a new paradigm for simulating and understanding human experiences in built environments.