Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition 1

Mar-19-2025, 13:50:12 GMT–Neural Information Processing Systems

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16 or 14x14. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input.

machine learning, natural language, pattern recognition, (15 more...)

Neural Information Processing Systems

Mar-19-2025, 13:50:12 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > Minnesota (0.28)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning
      - Neural Networks > Deep Learning (0.46)
      - Pattern Recognition > Image Matching (0.61)
    - Natural Language (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)

Duplicate Docs Excel Report

Title
64517d8435994992e682b3e4aa0a0661-Paper.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found