Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers