DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Oct-11-2024, 05:03:43 GMT–Neural Information Processing Systems

Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically.

dynamic token sparsification, efficient vision transformer, vision transformer, (3 more...)

Neural Information Processing Systems

Oct-11-2024, 05:03:43 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.97)