DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion Yilong Chen

Neural Information Processing Systems 

Large language models (LLMs) with billions of parameters demonstrate impressive performance.