Mini-Sequence Transformers: Optimizing Intermediate Memory for Long Sequences Training
–Neural Information Processing Systems
We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes.
Neural Information Processing Systems
Mar-22-2026, 02:36:53 GMT
- Technology: