Critical attention scaling in long-context transformers

Open in new window