AITopics | Spatial Reasoning

Collaborating Authors

Spatial Reasoning

News Overviews Instructional Materials AI-Alerts Classics

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech

Liu, Rui, He, Shuwei, Hu, Yifan, Li, Haizhou

arXiv.org Artificial IntelligenceJan-14-2025

Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M2SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation. The code and audio samples are available at: https://github.com/AI-S2-Lab/M2SE-VTTS.

information, proceedings, spatial environment, (15 more...)

arXiv.org Artificial Intelligence

2412.11409

Country:

Asia > China > Guangdong Province > Shenzhen (0.05)
Asia > Mongolia (0.04)
Asia > China > Inner Mongolia (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
(2 more...)

Add feedback

Detecting Contextual Anomalies by Discovering Consistent Spatial Regions

Yang, Zhengye, Radke, Richard J.

arXiv.org Artificial IntelligenceJan-14-2025

We describe a method for modeling spatial context to enable video anomaly detection. The main idea is to discover regions that share similar object-level activities by clustering joint object attributes using Gaussian mixture models. We demonstrate that this straightforward approach, using orders of magnitude fewer parameters than competing models, achieves state-of-the-art performance in the challenging spatial-context-dependent Street Scene dataset. As a side benefit, the high-resolution discovered regions learned by the model also provide explainable normalcy maps for human operators without the need for any pre-trained segmentation model.

anomaly, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.0847

Country: North America > United States (1.00)

Genre: Research Report (0.50)

Industry: Government > Regional Government (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.87)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.78)
(2 more...)

Add feedback

BiDepth Multimodal Neural Network: Bidirectional Depth Deep Learning Arcitecture for Spatial-Temporal Prediction

Ehsani, Sina, Pan, Fenglian, Hu, Qingpei, Liu, Jian

arXiv.org Artificial IntelligenceJan-14-2025

Accurate prediction of spatial-temporal (ST) information in dynamic systems, such as urban mobility and weather patterns, is a crucial yet challenging problem. The complexity stems from the intricate interplay between spatial proximity and temporal relevance, where both long-term trends and short-term fluctuations are present in convoluted patterns. Existing approaches, including traditional statistical methods and conventional neural networks, may provide inaccurate results due to the lack of an effective mechanism that simultaneously incorporates information at variable temporal depths while maintaining spatial context, resulting in a trade-off between comprehensive long-term historical analysis and responsiveness to short-term new information. To bridge this gap, this paper proposes the BiDepth Multimodal Neural Network (BDMNN) with bidirectional depth modulation that enables a comprehensive understanding of both long-term seasonality and short-term fluctuations, adapting to the complex ST context. Case studies with real-world public data demonstrate significant improvements in prediction accuracy, with a 12% reduction in Mean Squared Error for urban traffic prediction and a 15% improvement in rain precipitation forecasting compared to state-of-the-art benchmarks, without demanding extra computational resources.

artificial intelligence, correlation, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2501.08411

Country: North America > United States > Arizona (0.14)

Genre: Research Report (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Health & Medicine (1.00)
Energy (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CureGraph: Contrastive Multi-Modal Graph Representation Learning for Urban Living Circle Health Profiling and Prediction

Li, Jinlin, Zhou, Xiao

arXiv.org Artificial IntelligenceJan-13-2025

The early detection and prediction of health status decline among the elderly at the neighborhood level are of great significance for urban planning and public health policymaking. While existing studies affirm the connection between living environments and health outcomes, most rely on single data modalities or simplistic feature concatenation of multi-modal information, limiting their ability to comprehensively profile the health-oriented urban environments. To fill this gap, we propose CureGraph, a contrastive multi-modal representation learning framework for urban health prediction that employs graph-based techniques to infer the prevalence of common chronic diseases among the elderly within the urban living circles of each neighborhood. CureGraph leverages rich multi-modal information, including photos and textual reviews of residential areas and their surrounding points of interest, to generate urban neighborhood embeddings. By integrating pre-trained visual and textual encoders with graph modeling techniques, CureGraph captures cross-modal spatial dependencies, offering a comprehensive understanding of urban environments tailored to elderly health considerations. Extensive experiments on real-world datasets demonstrate that CureGraph improves the best baseline by $28\%$ on average in terms of $R^2$ across elderly disease risk prediction tasks. Moreover, the model enables the identification of stage-wise chronic disease progression and supports comparative public health analysis across neighborhoods, offering actionable insights for sustainable urban development and enhanced quality of life. The code is publicly available at https://github.com/jinlin2021/CureGraph.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.07157

Country: Asia > China (0.72)

Genre:

Overview (0.92)
Research Report > New Finding (0.67)
Research Report > Experimental Study (0.45)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Consumer Health (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
(3 more...)

Add feedback

Encoded Spatial Attribute in Multi-Tier Federated Learning

Kawnine, Asfia, Palma, Francis, Azghadi, Seyed Alireza Rahimi, Cao, Hung

arXiv.org Artificial IntelligenceJan-10-2025

This research presents an Encoded Spatial Multi-Tier Federated Learning approach for a comprehensive evaluation of aggregated models for geospatial data. In the client tier, encoding spatial information is introduced to better predict the target outcome. The research aims to assess the performance of these models across diverse datasets and spatial attributes, highlighting variations in predictive accuracy. Using evaluation metrics such as accuracy, our research reveals insights into the complexities of spatial granularity and the challenges of capturing underlying patterns in the data. We extended the scope of federated learning (FL) by having multi-tier along with the functionality of encoding spatial attributes. Our N-tier FL approach used encoded spatial data to aggregate in different tiers. We obtained multiple models that predicted the different granularities of spatial data. Our findings underscore the need for further research to improve predictive accuracy and model generalization, with potential avenues including incorporating additional features, refining model architectures, and exploring alternative modeling approaches. Our experiments have several tiers representing different levels of spatial aspects. We obtained accuracy of 75.62% and 89.52% for the global model without having to train the model using the data constituted with the designated tier. The research also highlights the importance of the proposed approach in real-time applications.

artificial intelligence, machine learning, spatial reasoning, (16 more...)

arXiv.org Artificial Intelligence

2501.05934

Country: North America > Canada > New Brunswick (0.14)

Genre: Research Report > New Finding (0.86)

Industry:

Transportation > Ground > Road (0.94)
Transportation > Electric Vehicle (0.94)
Information Technology (0.94)
Automobiles & Trucks (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.55)

Add feedback

Multi-class Decoding of Attended Speaker Direction Using Electroencephalogram and Audio Spatial Spectrum

Zhang, Yuanming, Lu, Jing, Chen, Fei, Du, Haoliang, Gao, Xia, Lin, Zhibin

arXiv.org Artificial IntelligenceJan-9-2025

Decoding the directional focus of an attended speaker from listeners' electroencephalogram (EEG) signals is essential for developing brain-computer interfaces to improve the quality of life for individuals with hearing impairment. Previous works have concentrated on binary directional focus decoding, i.e., determining whether the attended speaker is on the left or right side of the listener. However, a more precise decoding of the exact direction of the attended speaker is necessary for effective speech processing. Additionally, audio spatial information has not been effectively leveraged, resulting in suboptimal decoding results. In this paper, it is found that on the recently presented dataset with 14-class directional focus, models relying exclusively on EEG inputs exhibit significantly lower accuracy when decoding the directional focus in both leave-one-subject-out and leave-one-trial-out scenarios. By integrating audio spatial spectra with EEG features, the decoding accuracy can be effectively improved. The CNN, LSM-CNN, and Deformer models are employed to decode the directional focus from listeners' EEG signals and audio spatial spectra. The proposed Sp-EEG-Deformer model achieves notable 14-class decoding accuracies of 55.35% and 57.19% in leave-one-subject-out and leave-one-trial-out scenarios with a decision window of 1 second, respectively. Experiment results indicate increased decoding accuracy as the number of alternative directions reduces. These findings suggest the efficacy of our proposed dual modal directional focus decoding strategy.

accuracy, dataset, directional focus, (15 more...)

arXiv.org Artificial Intelligence

2411.06928

Country:

Asia > China > Jiangsu Province > Nanjing (0.05)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.67)

Add feedback

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Yuan, Yuqian, Zhang, Hang, Li, Wentong, Cheng, Zesen, Zhang, Boqiang, Li, Long, Li, Xin, Zhao, Deli, Zhang, Wenqiao, Zhuang, Yueting, Zhu, Jianke, Bing, Lidong

arXiv.org Artificial IntelligenceJan-8-2025

Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities.

dataset, representation, video, (13 more...)

arXiv.org Artificial Intelligence

2501.00599

Country: North America > Canada (0.04)

Genre: Research Report (0.64)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

S2 Chunking: A Hybrid Framework for Document Segmentation Through Integrated Spatial and Semantic Analysis

Verma, Prashant

arXiv.org Artificial IntelligenceJan-8-2025

Document chunking is a critical task in natural language processing (NLP) that involves dividing a document into meaningful segments. Traditional methods often rely solely on semantic analysis, ignoring the spatial layout of elements, which is crucial for understanding relationships in complex documents. This paper introduces a novel hybrid approach that combines layout structure, semantic analysis, and spatial relationships to enhance the cohesion and accuracy of document chunks. By leveraging bounding box information (bbox) and text embeddings, our method constructs a weighted graph representation of document elements, which is then clustered using spectral clustering. Experimental results demonstrate that this approach outperforms traditional methods, particularly in documents with diverse layouts such as reports, articles, and multi-column designs. The proposed method also ensures that no chunk exceeds a specified token length, making it suitable for use cases where token limits are critical (e.g., language models with input size limitations)

artificial intelligence, natural language, text processing, (17 more...)

arXiv.org Artificial Intelligence

2501.05485

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.52)

Add feedback

STContext: A Multifaceted Dataset for Developing Context-aware Spatio-temporal Crowd Mobility Prediction Models

Chen, Liyue, Fang, Jiangyi, Liu, Tengfei, Gao, Fangyuan, Wang, Leye

arXiv.org Artificial IntelligenceJan-7-2025

In smart cities, context-aware spatio-temporal crowd flow prediction (STCFP) models leverage contextual features (e.g., weather) to identify unusual crowd mobility patterns and enhance prediction accuracy. However, the best practice for incorporating contextual features remains unclear due to inconsistent usage of contextual features in different papers. Developing a multifaceted dataset with rich types of contextual features and STCFP scenarios is crucial for establishing a principled context modeling paradigm. Existing open crowd flow datasets lack an adequate range of contextual features, which poses an urgent requirement to build a multifaceted dataset to fill these research gaps. To this end, we create STContext, a multifaceted dataset for developing context-aware STCFP models. Specifically, STContext provides nine spatio-temporal datasets across five STCFP scenarios and includes ten contextual features, including weather, air quality index, holidays, points of interest, road networks, etc. Besides, we propose a unified workflow for incorporating contextual features into deep STCFP methods, with steps including feature transformation, dependency modeling, representation fusion, and training strategies. Through extensive experiments, we have obtained several useful guidelines for effective context modeling and insights for future research. The STContext is open-sourced at https://github.com/Liyue-Chen/STContext.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2501.03583

Country:

North America > United States > New York > New York County > New York City (0.15)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Alaska > Anchorage Municipality > Anchorage (0.04)
(15 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation > Ground > Road (1.00)
Transportation > Passenger (0.93)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
(3 more...)

Add feedback

GNN-based Decentralized Perception in Multirobot Systems for Predicting Worker Actions

Imran, Ali, Beltrame, Giovanni, St-Onge, David

arXiv.org Artificial IntelligenceJan-7-2025

-- In industrial environments, predicting human actions is essential for ensuring safe and effective collaboration between humans and robots. This paper introduces a perception framework that enables mobile robots to understand and share information about human actions in a decentralized way. The framework first allows each robot to build a spatial graph representing its surroundings, which it then shares with other robots. This shared spatial data is combined with temporal information to track human behavior over time. A swarm-inspired decision-making process is used to ensure all robots agree on a unified interpretation of the human's actions. Results show that adding more robots and incorporating longer time sequences improve prediction accuracy. Collaborative robots are poised to become a cornerstone of Industry 5.0 [1], emphasizing human-centric design solutions to meet the flexibility demands of hyper-customized industrial processes [2]. Significant efforts have been directed toward identifying key enabling technologies to enhance robotic systems with advanced situational awareness and robust safety features for human coworkers. Two pivotal technologies stand out: individualized human-machine interaction systems that merge the strengths of humans and machines, and the application of AI to improve workplace safety [3].

information, prediction, robot, (16 more...)

arXiv.org Artificial Intelligence

2501.04193

Country:

North America > Canada (0.04)
Europe (0.04)

Genre: Research Report (0.70)

Industry: Transportation (0.47)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.66)

Add feedback