Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models

Open in new window