AITopics | global context

Multi-Kernel Correlation-Attention Vision Transformer for Enhanced Contextual Understanding and Multi-Scale Integration

Neural Information Processing SystemsJun-17-2026, 22:48:31 GMT

Significant progress has been achieved using Vision Transformers (ViTs) in computer vision. However, challenges persist in modeling multi-scale spatial relationships, hindering effective integration of fine-grained local details and longrange global dependencies. To address this limitation, a Multi-Kernel CorrelationAttention Vision Transformer (MK-CAViT) grounded in the Hirschfeld-GebeleinRényi (HGR) theory was proposed, introducing three key innovations. A parallel multi-kernel architecture was utilized to extract multi-scale features through small, medium, and large kernels, overcoming the single-scale constraints of conventional ViTs. The cross-scale interactions were enhanced through the Fast-HGR attention mechanism, which models nonlinear dependencies and applies adaptive scaling to weigh connections and refine contextual reasoning. Additionally, a stable multi-scale fusion strategy was adopted, integrating dynamic normalization and staged learning to mitigate gradient variance, progressively fusing local and global contexts, and improving training stability.

artificial intelligence, dependency, machine learning, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Diagnostic Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

MR. Video: MapReduce as an Effective Principle for Long Video Understanding

Neural Information Processing SystemsJun-10-2026, 01:39:30 GMT

The fundamental challenge of long video understanding, e.g., question answering, lies in the extensive number of frames, making it infeasible to densely understand the local details while comprehensively digest the global contexts, especially within a limited context length. To address this problem, our insight is to process short video segments individually and combine these segment-level analyses into a final response. This intuition is noted in the well-established MapReduce principle in big data processing and is naturally compatible with inference scaling at the system level. Motivated by this, we propose MR. Video (pronounced as mister video), a long video understanding framework adopting the MapReduce principle. We define the standard operations of MapReduce in a long video understanding context: the Map steps conduct independent and sequence-parallel dense perception on short video segments, covering local details, while the Reduce steps comprehensively aggregate the segment-level results into an answer with global contexts.

artificial intelligence, proceedings, video agent, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.88)

Add feedback

1cdf14d1e3699d61d237cf76ce1c2dca-Supplemental.pdf

Neural Information Processing SystemsApr-24-2026, 23:51:05 GMT

We follow [21] and implement our image compression models as "VQGANs". More specifically, we use the official implementation provided at https://github.com/CompVis/ For FFHQ, we train such a compression model from scratch. See Tab. 4 for an overview. As some of the codebook entries remain unused after training, we shrink the codebook to its effective size when training a generative model on top of it.

artificial intelligence, dataset, machine learning, (18 more...)

Neural Information Processing Systems

Technology: