MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training
Luo, Cheng, Zhao, Jiawei, Chen, Zhuoming, Chen, Beidi, Anandkumar, Anima
–arXiv.org Artificial Intelligence
We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations due to our careful memory optimizations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks.
arXiv.org Artificial Intelligence
Jul-21-2024
- Country:
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States
- California (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- South America > Chile
- Europe > Italy
- Genre:
- Research Report > Experimental Study (0.93)
- Industry:
- Health & Medicine (0.93)
- Technology: