Not enough data to create a plot.
Try a different view from the menu above.
Ashfaq, Moetasim
ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability
Wang, Xiao, Tsaris, Aristeidis, Liu, Siyan, Choi, Jong-Youl, Fan, Ming, Zhang, Wei, Yin, Junqi, Ashfaq, Moetasim, Lu, Dan, Balaprakash, Prasanna
Earth system predictability is challenged by the complexity of environmental dynamics and the multitude of variables involved. Current AI foundation models, although advanced by leveraging large and heterogeneous data, are often constrained by their size and data integration, limiting their effectiveness in addressing the full range of Earth system prediction challenges. To overcome these limitations, we introduce the Oak Ridge Base Foundation Model for Earth System Predictability (ORBIT), an advanced vision-transformer model that scales up to 113 billion parameters using a novel hybrid tensor-data orthogonal parallelism technique. As the largest model of its kind, ORBIT surpasses the current climate AI foundation model size by a thousandfold. Performance scaling tests conducted on the Frontier supercomputer have demonstrated that ORBIT achieves 230 to 707 PFLOPS, with scaling efficiency maintained at 78% to 96% across 24,576 AMD GPUs. These breakthroughs establish new advances in AI-driven climate modeling and demonstrate promise to significantly improve the Earth system predictability.
Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier
Tsaris, Aristeidis, Zhang, Chengming, Wang, Xiao, Yin, Junqi, Liu, Siyan, Ashfaq, Moetasim, Fan, Ming, Choi, Jong Youl, Wahib, Mohamed, Lu, Dan, Balaprakash, Prasanna, Wang, Feiyi
Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training, achieving a 94% batch scaling efficiency on 2,048 AMD-MI250X GPUs. Evaluating sequence parallelism in ViTs, particularly in models up to 10B parameters, highlighted substantial bottlenecks. We countered these with hybrid sequence, pipeline, tensor parallelism, and flash attention strategies, to scale beyond single GPU memory limits. Our method significantly enhances climate modeling accuracy by 20% in temperature predictions, marking the first training of a transformer model on a full-attention matrix over 188K sequence length.