DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion Yilong Chen

Open in new window