Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Open in new window