AiluRus: A Scalable ViT Framework for Dense Prediction

Dec-25-2025, 16:36:59 GMT–Neural Information Processing Systems

Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, their complexity dramatically increases when handling long token sequences, particularly for dense prediction tasks that require high-resolution input. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we select anchors from the token sequence using the proposed spatial-aware density-based clustering algorithm. Tokens that are adjacent to anchors are merged to form low-resolution regions, while others are preserved independently as high-resolution.

dense prediction task, name change, scalable vit framework, (6 more...)

Neural Information Processing Systems

Dec-25-2025, 16:36:59 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Machine Learning > Statistical Learning
    - Clustering (0.81)