AITopics

Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.

large language model, machine learning, natural language, (20 more...)

2511.21688

Country: North America (0.28)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.73)

Qwen3-VL Technical Report

Bai, Shuai, Cai, Yuxuan, Chen, Ruizhe, Chen, Keqin, Chen, Xionghui, Cheng, Zesen, Deng, Lianghao, Ding, Wei, Gao, Chang, Ge, Chunjiang, Ge, Wenbin, Guo, Zhifang, Huang, Qidong, Huang, Jie, Huang, Fei, Hui, Binyuan, Jiang, Shutong, Li, Zhaohai, Li, Mingsheng, Li, Mei, Li, Kaixin, Lin, Zicheng, Lin, Junyang, Liu, Xuejing, Liu, Jiawei, Liu, Chenglong, Liu, Yang, Liu, Dayiheng, Liu, Shixuan, Lu, Dunjie, Luo, Ruilin, Lv, Chenxu, Men, Rui, Meng, Lingchen, Ren, Xuancheng, Ren, Xingzhang, Song, Sibo, Sun, Yuchong, Tang, Jun, Tu, Jianhong, Wan, Jianqiang, Wang, Peng, Wang, Pengfei, Wang, Qiuyue, Wang, Yuxuan, Xie, Tianbao, Xu, Yiheng, Xu, Haiyang, Xu, Jin, Yang, Zhibo, Yang, Mingkun, Yang, Jianxin, Yang, An, Yu, Bowen, Zhang, Fei, Zhang, Hang, Zhang, Xi, Zheng, Bo, Zhong, Humen, Zhou, Jingren, Zhou, Fan, Zhou, Jing, Zhu, Yuanzhi, Zhu, Ke

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

benchmark, large language model, machine learning, (19 more...)

2511.21631

Genre: Research Report (0.82)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.92)
(3 more...)

$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion

Zhan, Zhihao, Zhou, Jiaying, Zhang, Likui, Lv, Qinhan, Liu, Hao, Zhang, Jusheng, Li, Weizheng, Chen, Ziliang, Chen, Tianshui, Wang, Keze, Lin, Liang, Wang, Guangrun

Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, E0 offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and 2. discrete diffusion matches the true quantized nature of real-world robot control-whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals-and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, E0 supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions-yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that E0 delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.

artificial intelligence, arxiv preprint arxiv, machine learning, (17 more...)

2511.21542

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.93)

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

Wu, Yuqi, Zheng, Wenzhao, Zhou, Jie, Lu, Jiwen

Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs.

artificial intelligence, machine learning, spatial reasoning, (18 more...)

2507.02863

Country: Asia (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.35)

arXiv.org Artificial IntelligenceNov-27-2025

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Xu, Peiran, Wang, Sudong, Zhu, Yao, Li, Jianing, Zhang, Yunjian

Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. T o address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. T o provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.

large language model, machine learning, natural language, (12 more...)

2511.21471

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Ghosh, Kriti, Chakraborty, Devjyoti, Ramaswamy, Lakshmish, Bhandarkar, Suchendra M., Kim, In Kee, O'Hare, Nancy, Mishra, Deepak

$Δ$-NeRF: Incremental Refinement of Neural Radiance Fields through Residual Control and Knowledge Transfer

arXiv.org Artificial IntelligenceNov-27-2025

Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47\% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20\% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42\%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5\% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

artificial intelligence, controller, machine learning, (16 more...)

2511.20804

Country: North America > United States (0.68)

Genre: Research Report (1.00)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.34)

Denoising Refinement Diffusion Models for Simultaneous Generation of Multi-scale Mobile Network Traffic

Qi, Xiaoqian, Chai, Haoye, Liu, Sichang, Yue, Lei, Pan, Raoyuan, Wang, Yue, Li, Yong

The planning, management, and resource scheduling of cellular mobile networks require joint estimation of mobile traffic across different layers and nodes. Mobile traffic generation can proactively anticipate user demands and capture the dynamics of network load. However, existing methods mainly focus on generating traffic at a single spatiotemporal resolution, making it difficult to jointly model multi-scale traffic patterns. In this paper, we propose ZoomDiff, a diffusion-based model for multi-scale mobile traffic generation. ZoomDiff maps urban environmental context into mobile traffic with multiple spatial and temporal resolutions through a set of customized Denoising Refinement Diffusion Models (DRDM). DRDM employs a multi-stage noise-adding and denoising mechanism, enabling different stages to generate traffic at distinct spatiotemporal resolutions. This design aligns the progressive denoising process with hierarchical network layers, including base stations, cells, and grids of varying granularities. Experiments on real-world mobile traffic datasets show that ZoomDiff achieves at least an 18.4% improvement over state-of-the-art baselines in multi-scale traffic generation tasks. Moreover, ZoomDiff demonstrates strong efficiency and cross-city generalization, highlighting its potential as a powerful generative framework for modeling multi-scale mobile network dynamics.

artificial intelligence, machine learning, traffic, (19 more...)

2511.17532

Country:

Asia > China (0.48)
North America > United States (0.30)

Genre: Research Report > New Finding (0.46)

Industry:

Telecommunications (1.00)
Information Technology > Networks (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Shelukhan, Matvei, Mamedov, Timur, Kvanchiani, Karina

StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections

Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. T o address this issue, we propose Stable-Track, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. W e propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving 11.6% HOTA improvement at 1 Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.

artificial intelligence, detection, machine learning, (18 more...)

2511.20418

Country: Europe > Russia (0.28)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)

MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

Huang, Chengyue, Zhang, Mellon M., Azarcon, Robert, Chou, Glen, Kira, Zsolt

Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.

large language model, machine learning, natural language, (19 more...)

2511.19878

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.87)
(2 more...)

Dual-branch Spatial-Temporal Self-supervised Representation for Enhanced Road Network Learning

Guo, Qinghong, Wang, Yu, Cao, Ji, Zheng, Tongya, Dai, Junshu, Hu, Bingde, Liu, Shunyu, Jin, Canghong

Road network representation learning (RNRL) has attracted increasing attention from both researchers and practitioners as various spatiotemporal tasks are emerging. Recent advanced methods leverage Graph Neural Networks (GNNs) and contrastive learning to characterize the spatial structure of road segments in a self-supervised paradigm. However, spatial heterogeneity and temporal dynamics of road networks raise severe challenges to the neighborhood smoothing mechanism of self-supervised GNNs. To address these issues, we propose a $\textbf{D}$ual-branch $\textbf{S}$patial-$\textbf{T}$emporal self-supervised representation framework for enhanced road representations, termed as DST. On one hand, DST designs a mix-hop transition matrix for graph convolution to incorporate dynamic relations of roads from trajectories. Besides, DST contrasts road representations of the vanilla road network against that of the hypergraph in a spatial self-supervised way. The hypergraph is newly built based on three types of hyperedges to capture long-range relations. On the other hand, DST performs next token prediction as the temporal self-supervised task on the sequences of traffic dynamics based on a causal Transformer, which is further regularized by differentiating traffic modes of weekdays from those of weekends. Extensive experiments against state-of-the-art methods verify the superiority of our proposed framework. Moreover, the comprehensive spatiotemporal modeling facilitates DST to excel in zero-shot learning scenarios.

artificial intelligence, machine learning, natural language, (20 more...)

2511.06633

Country: Asia > China (0.96)

Genre: Research Report (1.00)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)