Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers

Open in new window