A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Open in new window