AITopics

2504.04739

Country:

Asia (0.68)
Europe > United Kingdom > England > Greater London > London (0.28)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.93)
Research Report > New Finding (0.87)

Industry:

Health & Medicine > Public Health (1.00)
Health & Medicine > Consumer Health (1.00)
Health & Medicine > Health Care Providers & Services (0.93)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceAug-4-2025

TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation

Zhou, Jiale, Wang, Wenhan, Li, Shikun, Qu, Xiaolei, Guo, Xin, Liu, Yizhong, Tang, Wenzhong, Lin, Xun, Zheng, Yefeng

Tubular structure segmentation (TSS) is important for various applications, such as hemodynamic analysis and route navigation. Despite significant progress in TSS, domain shifts remain a major challenge, leading to performance degradation in unseen target domains. Unlike other segmentation tasks, TSS is more sensitive to domain shifts, as changes in topological structures can compromise segmentation integrity, and variations in local features distinguishing foreground from background (e.g., texture and contrast) may further disrupt topological continuity. To address these challenges, we propose Topology-enhanced Test-Time Adaptation (TopoTTA), the first test-time adaptation framework designed specifically for TSS. TopoTTA consists of two stages: Stage 1 adapts models to cross-domain topological discrepancies using the proposed Topological Meta Difference Convolutions (TopoMDCs), which enhance topological representation without altering pre-trained parameters; Stage 2 improves topological continuity by a novel Topology Hard sample Generation (TopoHG) strategy and prediction alignment on hard samples with pseudo-labels in the generated pseudo-break regions. Extensive experiments across four scenarios and ten datasets demonstrate TopoTTA's effectiveness in handling topological distribution shifts, achieving an average improvement of 31.81% in clDice. TopoTTA also serves as a plug-and-play TTA solution for CNN-based TSS models.

artificial intelligence, machine learning, proceedings, (13 more...)

2508.00442

Country: North America > Canada (0.28)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area (0.87)
Health & Medicine > Diagnostic Medicine > Imaging (0.69)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.35)

arXiv.org Artificial IntelligenceAug-1-2025

Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

Cai, Shaofei, Mu, Zhancun, Xia, Haiwen, Zhang, Bowei, Liu, Anji, Liang, Yitao

While Reinforcement Learning (RL) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents. A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds. Specifically, we explore RL's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D worlds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by $4\times$ and enables zero-shot generalization of spatial reasoning across diverse environments, including real-world settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents' spatial reasoning.

large language model, machine learning, reinforcement learning, (18 more...)

2507.23698

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games (0.90)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceJul-31-2025

Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking

John, Shahla

Real-time video analysis remains a challenging problem in computer vision, requiring efficient processing of both spatial and temporal information while maintaining computational efficiency. Existing approaches often struggle to balance accuracy and speed, particularly in resource-constrained environments. In this work, we present a unified framework that leverages advanced spatial-temporal modeling techniques for simultaneous action recognition and object tracking. Our approach builds upon recent advances in parallel sequence modeling and introduces a novel hierarchical attention mechanism that adaptively focuses on relevant spatial regions across temporal sequences. We demonstrate that our method achieves state-of-the-art performance on standard benchmarks while maintaining real-time inference speeds. Extensive experiments on UCF-101, HMDB-51, and MOT17 datasets show improvements of 3.2% in action recognition accuracy and 2.8% in tracking precision compared to existing methods, with 40% faster inference time.

artificial intelligence, computer vision, machine learning, (14 more...)

2507.22421

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.78)

Li, Kunyang, Santiago, Jeffrey A Chan, Samarasinghe, Sarinda Dhanesh, Liu, Gaowen, Shah, Mubarak

GVD: Guiding Video Diffusion Model for Scalable Video Distillation

arXiv.org Artificial IntelligenceJul-31-2025

To address the larger computation and storage requirements associated with large video datasets, video dataset distillation aims to capture spatial and temporal information in a significantly smaller dataset, such that training on the distilled data has comparable performance to training on all of the data. We propose GVD: Guiding Video Diffusion, the first diffusion-based video distillation method. GVD jointly distills spatial and temporal features, ensuring high-fidelity video generation across diverse actions while capturing essential motion information. Our method's diverse yet representative distillations significantly outperform previous state-of-the-art approaches on the MiniUCF and HMDB51 datasets across 5, 10, and 20 Instances Per Class (IPC). Specifically, our method achieves 78.29 percent of the original dataset's performance using only 1.98 percent of the total number of frames in MiniUCF. Additionally, it reaches 73.83 percent of the performance with just 3.30 percent of the frames in HMDB51. Experimental results across benchmark video datasets demonstrate that GVD not only achieves state-of-the-art performance but can also generate higher resolution videos and higher IPC without significantly increasing computational cost.

artificial intelligence, machine learning, spatial reasoning, (17 more...)

2507.2236

Country: North America > United States (0.28)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)

Enhancing Spatial Reasoning through Visual and Textual Thinking

Liang, Xun, Guo, Xin, Jin, Zhongming, Pan, Weihang, Shang, Penghui, Cai, Deng, Lin, Binbin, Ye, Jieping

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of essential targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, Our model conducts long-term thinking based on visual cues and dialogues, gradually inferring the answers to spatial reasoning problems. To effectively support the model's training, we perform manual corrections to the existing spatial reasoning dataset, eliminating numerous incorrect labels resulting from automatic annotation, restructuring the data input format to enhance generalization ability, and developing thinking processes with logical reasoning details. Without introducing additional information (such as masks or depth), our model's overall average level in several spatial understanding tasks has significantly improved compared with other models.

artificial intelligence, machine learning, natural language, (16 more...)

2507.20529

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Nobin, Pritom Ray, Rifat, Imran Ahammad

STARN-GAT: A Multi-Modal Spatio-Temporal Graph Attention Network for Accident Severity Prediction

Accurate prediction of traffic accident severity is critical for improving road safety, optimizing emergency response strategies, and informing the design of safer transportation infrastructure. However, existing approaches often struggle to effectively model the intricate interdependencies among spatial, temporal, and contextual variables that govern accident outcomes. In this study, we introduce STARN-GAT, a Multi-Modal Spatio-Temporal Graph Attention Network, which leverages adaptive graph construction and modality-aware attention mechanisms to capture these complex relationships. Unlike conventional methods, STARN-GAT integrates road network topology, temporal traffic patterns, and environmental context within a unified attention-based framework. The model is evaluated on the Fatality Analysis Reporting System (FARS) dataset, achieving a Macro F1-score of 85 percent, ROC-AUC of 0.91, and recall of 81 percent for severe incidents. To ensure generalizability within the South Asian context, STARN-GAT is further validated on the ARI-BUET traffic accident dataset, where it attains a Macro F1-score of 0.84, recall of 0.78, and ROC-AUC of 0.89. These results demonstrate the model's effectiveness in identifying high-risk cases and its potential for deployment in real-time, safety-critical traffic management systems. Furthermore, the attention-based architecture enhances interpretability, offering insights into contributing factors and supporting trust in AI-assisted decision-making. Overall, STARN-GAT bridges the gap between advanced graph neural network techniques and practical applications in road safety analytics.

data mining, machine learning, prediction, (16 more...)

2507.20451

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report > New Finding (1.00)

Industry: Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(2 more...)

Digital Twin Channel-Enabled Online Resource Allocation for 6G: Principle, Architecture and Application

Li, Tongjie, Zhang, Jianhua, Yu, Li, Zhang, Yuxiang, Cai, Yunlong, Xu, Fan, Liu, Guangyi

The emergence of sixth-generation (6G) networks is reshaping wireless communications to support mission-critical applications such as the Industrial Internet of Things (IIoT), autonomous driving, and smart manufacturing. Compared with 5G, 6G imposes significantly more stringent requirements on latency, reliability, adaptability, and end-to-end responsiveness [1, 2]. IIoT scenarios are particularly challenging due to the coexistence of complex radio propagation conditions and diverse service requirements. Dense deployments, metallic scatterers, and dynamic obstacles give rise to severe multipath fading, especially in high-frequency bands such as mmWave and terahertz, where signal stability is highly sensitive to physical structures [3, 4]. In parallel, service demands span multiple categories, such as periodic sensing, closed-loop control, event-triggered communication, and edge computing, each with distinct quality-of-service (QoS) requirements [5]. To address these multifaceted challenges, resource allocation mechanisms must be environment-aware, latency-sensitive, and capable of online adaptation across large-scale, dynamic deployments. Artificial intelligence (AI)-driven resource allocation has attracted growing interest due to its ability to learn underlying correlations from sensing data and historical records.

artificial intelligence, communication, machine learning, (15 more...)

2507.19974

Country: Asia > China (0.72)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.47)

NAICS-Aware Graph Neural Networks for Large-Scale POI Co-visitation Prediction: A Multi-Modal Dataset and Methodology

Alrubyli, Yazeed, Alomeir, Omar, Wafa, Abrar, Hidvégi, Diána, Alrasheed, Hend, Bahrami, Mohsen

Understanding where people go after visiting one business is crucial for urban planning, retail analytics, and location-based services. However, predicting these co-visitation patterns across millions of venues remains challenging due to extreme data sparsity and the complex interplay between spatial proximity and business relationships. Traditional approaches using only geographic distance fail to capture why coffee shops attract different customer flows than fine dining restaurants, even when co-located. We introduce NAICS-aware GraphSAGE, a novel graph neural network that integrates business taxonomy knowledge through learnable embeddings to predict population-scale co-visitation patterns. Our key insight is that business semantics, captured through detailed industry codes, provide crucial signals that pure spatial models cannot explain. The approach scales to massive datasets (4.2 billion potential venue pairs) through efficient state-wise decomposition while combining spatial, temporal, and socioeconomic features in an end-to-end framework. Evaluated on our POI-Graph dataset comprising 94.9 million co-visitation records across 92,486 brands and 48 US states, our method achieves significant improvements over state-of-the-art baselines: the R-squared value increases from 0.243 to 0.625 (a 157 percent improvement), with strong gains in ranking quality (32 percent improvement in NDCG at 10).

data mining, machine learning, prediction, (15 more...)

2507.19697

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report > Experimental Study (0.95)

Industry:

Banking & Finance > Economy (0.93)
Consumer Products & Services > Restaurants (0.86)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis > Beverages (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Li, Haichuan, Westerlund, Tomi

Co-Win: Joint Object Detection and Instance Segmentation in LiDAR Point Clouds via Collaborative Window Processing

Accurate perception and scene understanding in complex urban environments is a critical challenge for ensuring safe and efficient autonomous navigation. In this paper, we present Co-Win, a novel bird's eye view (BEV) perception framework that integrates point cloud encoding with efficient parallel window-based feature extraction to address the multi-modality inherent in environmental understanding. Our method employs a hierarchical architecture comprising a specialized encoder, a window-based backbone, and a query-based decoder head to effectively capture diverse spatial features and object relationships. Unlike prior approaches that treat perception as a simple regression task, our framework incorporates a variational approach with mask-based instance segmentation, enabling fine-grained scene decomposition and understanding. The Co-Win architecture processes point cloud data through progressive feature extraction stages, ensuring that predicted masks are both data-consistent and contextually relevant. Furthermore, our method produces interpretable and diverse instance predictions, enabling enhanced downstream decision-making and planning in autonomous driving systems.

artificial intelligence, point cloud, spatial reasoning, (13 more...)

2507.19691

Country: Europe > Finland (0.15)

Genre: Research Report (0.50)

Industry:

Transportation > Ground > Road (0.35)
Information Technology (0.35)
Automobiles & Trucks (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.48)