How Transformers Utilize Multi Head Attention in In Context Learning A Case Study on Sparse Linear Regression

Open in new window