Elastic ViTs from Pretrained Models without Retraining
–Neural Information Processing Systems
Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/
Neural Information Processing Systems
Jun-15-2026, 15:46:58 GMT
- Country:
- Europe (0.28)
- Genre:
- Research Report
- Experimental Study (1.00)
- Promising Solution (0.88)
- New Finding (0.68)
- Research Report
- Industry:
- Information Technology (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Natural Language (1.00)
- Machine Learning
- Statistical Learning (1.00)
- Neural Networks > Deep Learning (0.68)
- Evolutionary Systems (0.67)
- Information Technology > Artificial Intelligence