Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Open in new window