AITopics | dense prediction task

We introduce point affiliation into feature upsampling, a notion that describes the affiliation of each upsampled point to asemantic cluster formed by local decoder feature points with semantic similarity. By rethinking point affiliation, we present a generic formulation for generating upsampling kernels. The kernels encourage notonly semantic smoothness butalsoboundary sharpness intheupsampled feature maps. Such properties are particularly useful for some dense prediction tasks such as semantic segmentation. The key idea of our formulation istogenerate similarity-awarekernels bycomparing thesimilarity between each encoder feature point and the spatially associated local region of decoder features.

artificial intelligence, kernel, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Asia > China (0.05)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

4e0928de075538c593fbdabb0c5ef2c3-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 14:35:18 GMT

arxiv preprint arxiv, transformer, vision transformer, (12 more...)

Neural Information Processing Systems

Country: Oceania > Australia > South Australia > Adelaide (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

15212bd2265c4a3ab0dbc1b1982c1b69-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 04:36:48 GMT

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Colorado (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

3000311ca56a1cb93397bc676c0b7fff-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 23:45:07 GMT

learning, pixel, representation, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (0.66)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.93)
(2 more...)

Add feedback

AiluRus: A Scalable ViT Framework for Dense Prediction

Neural Information Processing SystemsDec-25-2025, 16:36:59 GMT

Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, their complexity dramatically increases when handling long token sequences, particularly for dense prediction tasks that require high-resolution input. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we select anchors from the token sequence using the proposed spatial-aware density-based clustering algorithm. Tokens that are adjacent to anchors are merged to form low-resolution regions, while others are preserved independently as high-resolution.

dense prediction task, name change, scalable vit framework, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.81)

Add feedback

SAPA: Similarity-Aware Point Affiliation for Feature Upsampling

Neural Information Processing SystemsDec-24-2025, 15:42:57 GMT

We introduce point affiliation into feature upsampling, a notion that describes the affiliation of each upsampled point to a semantic cluster formed by local decoder feature points with semantic similarity. By rethinking point affiliation, we present a generic formulation for generating upsampling kernels. The kernels encourage not only semantic smoothness but also boundary sharpness in the upsampled feature maps. Such properties are particularly useful for some dense prediction tasks such as semantic segmentation. The key idea of our formulation is to generate similarity-aware kernels by comparing the similarity between each encoder feature point and the spatially associated local region of decoder features. In this way, the encoder feature point can function as a cue to inform the semantic cluster of upsampled feature points. To embody the formulation, we further instantiate a lightweight upsampling operator, termed Similarity-Aware Point Affiliation (SAPA), and investigate its variants. SAPA invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, depth estimation, and image matting. Code is available at: https://github.com/poppinace/sapa

feature point, name change, similarity-aware point affiliation, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.64)
Information Technology > Artificial Intelligence > Vision (0.60)

Add feedback

MST: Masked Self-Supervised Transformer for Visual Representation

Neural Information Processing SystemsDec-24-2025, 06:33:23 GMT

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.

masked self-supervised transformer, name change, visual representation, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

FLSL: Feature-level Self-supervised Learning

Neural Information Processing SystemsDec-24-2025, 00:12:38 GMT

Current self-supervised learning (SSL) methods (e.g., SimCLR, DINO, VICReg, MOCOv3) target primarily on representations at instance level and do not generalize well to dense prediction tasks, such as object detection and segmentation. Towards aligning SSL with dense predictions, this paper demonstrates for the first time the underlying mean-shift clustering process of Vision Transformers (ViT), which aligns well with natural image semantics (e.g., a world of objects and stuffs). By employing transformer for joint embedding and clustering, we propose a bi-level feature clustering SSL method, coined Feature-Level Self-supervised Learning (FLSL). We present the formal definition of the FLSL problem and construct the objectives from the mean-shift and k-means perspectives. We show that FLSL promotes remarkable semantic cluster representations and learns an embedding scheme amenable to intra-view and inter-view feature clustering. Experiments show that FLSL yields significant improvements in dense prediction tasks, achieving 44.9 (+2.8)% AP and 46.5% AP in object detection, as well as 40.8 (+2.3)%

feature-level self-supervised learning, flsl, self-supervised learning, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Collaborating Authors

dense prediction task

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

AiluRus: A Scalable ViT Framework for Dense Prediction

e6c2e85db1f1039177c4495ccd399ac4-Paper-Conference.pdf

83ccb398f3ce9c4d137011f36a03c7d4-Paper-Conference.pdf

4e0928de075538c593fbdabb0c5ef2c3-Paper.pdf

15212bd2265c4a3ab0dbc1b1982c1b69-Paper-Conference.pdf

3000311ca56a1cb93397bc676c0b7fff-Paper.pdf

AiluRus: A Scalable ViT Framework for Dense Prediction

SAPA: Similarity-Aware Point Affiliation for Feature Upsampling

MST: Masked Self-Supervised Transformer for Visual Representation

FLSL: Feature-level Self-supervised Learning