Critical attention scaling in long-context transformers