DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
–Neural Information Processing Systems
Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically.
Neural Information Processing Systems
Oct-11-2024, 05:03:43 GMT
- Technology:
- Information Technology > Artificial Intelligence > Vision (0.97)