Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Open in new window