Not enough data to create a plot.
Try a different view from the menu above.
Tsaris, Aristeidis
OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery
Dias, Philipe, Tsaris, Aristeidis, Bowman, Jordan, Potnis, Abhishek, Arndt, Jacob, Yang, H. Lexie, Lunga, Dalton
While the pretraining of Foundation Models (FMs) for remote sensing (RS) imagery is on the rise, models remain restricted to a few hundred million parameters. Scaling models to billions of parameters has been shown to yield unprecedented benefits including emergent abilities, but requires data scaling and computing resources typically not available outside industry R&D labs. In this work, we pair high-performance computing resources including Frontier supercomputer, America's first exascale system, and high-resolution optical RS data to pretrain billion-scale FMs. Our study assesses performance of different pretrained variants of vision Transformers across image classification, semantic segmentation and object detection benchmarks, which highlight the importance of data scaling for effective model scaling. Moreover, we discuss construction of a novel TIU pretraining dataset, model initialization, with data and pretrained models intended for public release. By discussing technical challenges and details often lacking in the related literature, this work is intended to offer best practices to the geospatial community toward efficient training and benchmarking of larger FMs.
Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars
Brewer, Wesley, Kashi, Aditya, Dash, Sajal, Tsaris, Aristeidis, Yin, Junqi, Shankar, Mallikarjun, Wang, Feiyi
In a post-ChatGPT world, this paper explores the potential of leveraging scalable artificial intelligence for scientific discovery. We propose that scaling up artificial intelligence on high-performance computing platforms is essential to address such complex problems. This perspective focuses on scientific use cases like cognitive simulations, large language models for scientific inquiry, medical image analysis, and physics-informed approaches. The study outlines the methodologies needed to address such challenges at scale on supercomputers or the cloud and provides exemplars of such approaches applied to solve a variety of scientific problems. In light of ChatGPT's growing popularity, the transformative potential of AI in science becomes increasingly evident. Although a number of recent articles highlight the transformative power of AI in science [1, 2, 3], few provide specifics how to implement such methods at scale on supercomputers. Using ChatGPT as an archetype, we argue that the success of such complex AI models results from two primary advancements: (1) the development of the transformer architecture, (2) the ability to train on vast amounts of internet-scale data. This process represents a broader trend within the field of AI where combining massive amounts of training data with large-scale computational resources becomes the foundation of scientific breakthroughs. Several examples underscore the integral role of using large-scale computational resources and colossal amounts of data to achieve scientific breakthroughs. For instance, Khan et al. [4] used AI and large-scale computing for advanced models of black hole mergers, leveraging a dataset of 14 million waveforms on the Summit supercomputer. Riley et al. [5] made significant progress towards the understanding the physics of stratified fluid turbulence by being able to model the Prandtl number of seven, which represents ocean water at 20 Such simulations required being simulated using four trillion grid points, which required petabytes of storage [6].
ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability
Wang, Xiao, Tsaris, Aristeidis, Liu, Siyan, Choi, Jong-Youl, Fan, Ming, Zhang, Wei, Yin, Junqi, Ashfaq, Moetasim, Lu, Dan, Balaprakash, Prasanna
Earth system predictability is challenged by the complexity of environmental dynamics and the multitude of variables involved. Current AI foundation models, although advanced by leveraging large and heterogeneous data, are often constrained by their size and data integration, limiting their effectiveness in addressing the full range of Earth system prediction challenges. To overcome these limitations, we introduce the Oak Ridge Base Foundation Model for Earth System Predictability (ORBIT), an advanced vision-transformer model that scales up to 113 billion parameters using a novel hybrid tensor-data orthogonal parallelism technique. As the largest model of its kind, ORBIT surpasses the current climate AI foundation model size by a thousandfold. Performance scaling tests conducted on the Frontier supercomputer have demonstrated that ORBIT achieves 230 to 707 PFLOPS, with scaling efficiency maintained at 78% to 96% across 24,576 AMD GPUs. These breakthroughs establish new advances in AI-driven climate modeling and demonstrate promise to significantly improve the Earth system predictability.
Pretraining Billion-scale Geospatial Foundational Models on Frontier
Tsaris, Aristeidis, Dias, Philipe Ambrozio, Potnis, Abhishek, Yin, Junqi, Wang, Feiyi, Lunga, Dalton
As AI workloads increase in scope, generalization capability becomes challenging for small task-specific models and their demand for large amounts of labeled training samples increases. On the contrary, Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning and have been shown to adapt to various tasks with minimal fine-tuning. Although large FMs have demonstrated significant impact in natural language processing and computer vision, efforts toward FMs for geospatial applications have been restricted to smaller size models, as pretraining larger models requires very large computing resources equipped with state-of-the-art hardware accelerators. Current satellite constellations collect 100+TBs of data a day, resulting in images that are billions of pixels and multimodal in nature. Such geospatial data poses unique challenges opening up new opportunities to develop FMs. We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data. We studied from end-to-end the performance and impact in the solution by scaling the model size. Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy when comparing a 100M parameter model. Moreover, we detail performance experiments on the Frontier supercomputer, America's first exascale system, where we study different model and data parallel approaches using PyTorch's Fully Sharded Data Parallel library. Specifically, we study variants of the Vision Transformer architecture (ViT), conducting performance analysis for ViT models with size up to 15B parameters. By discussing throughput and performance bottlenecks under different parallelism configurations, we offer insights on how to leverage such leadership-class HPC resources when developing large models for geospatial imagery applications.
Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier
Tsaris, Aristeidis, Zhang, Chengming, Wang, Xiao, Yin, Junqi, Liu, Siyan, Ashfaq, Moetasim, Fan, Ming, Choi, Jong Youl, Wahib, Mohamed, Lu, Dan, Balaprakash, Prasanna, Wang, Feiyi
Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training, achieving a 94% batch scaling efficiency on 2,048 AMD-MI250X GPUs. Evaluating sequence parallelism in ViTs, particularly in models up to 10B parameters, highlighted substantial bottlenecks. We countered these with hybrid sequence, pipeline, tensor parallelism, and flash attention strategies, to scale beyond single GPU memory limits. Our method significantly enhances climate modeling accuracy by 20% in temperature predictions, marking the first training of a transformer model on a full-attention matrix over 188K sequence length.
Track Seeding and Labelling with Embedded-space Graph Neural Networks
Choma, Nicholas, Murnane, Daniel, Ju, Xiangyang, Calafiura, Paolo, Conlon, Sean, Farrell, Steven, Prabhat, null, Cerati, Giuseppe, Gray, Lindsey, Klijnsma, Thomas, Kowalkowski, Jim, Spentzouris, Panagiotis, Vlimant, Jean-Roch, Spiropulu, Maria, Aurisano, Adam, Hewes, V, Tsaris, Aristeidis, Terao, Kazuhiro, Usher, Tracy
To address the unprecedented scale of HL-LHC data, the Exa.TrkX project is investigating a variety of machine learning approaches to particle track reconstruction. The most promising of these solutions, graph neural networks (GNN), process the event as a graph that connects track measurements (detector hits corresponding to nodes) with candidate line segments between the hits (corresponding to edges). Detector information can be associated with nodes and edges, enabling a GNN to propagate the embedded parameters around the graph and predict node-, edge- and graph-level observables. Previously, message-passing GNNs have shown success in predicting doublet likelihood, and we here report updates on the state-of-the-art architectures for this task. In addition, the Exa.TrkX project has investigated innovations in both graph construction, and embedded representations, in an effort to achieve fully learned end-to-end track finding. Hence, we present a suite of extensions to the original model, with encouraging results for hitgraph classification. In addition, we explore increased performance by constructing graphs from learned representations which contain non-linear metric structure, allowing for efficient clustering and neighborhood queries of data points. We demonstrate how this framework fits in with both traditional clustering pipelines, and GNN approaches. The embedded graphs feed into high-accuracy doublet and triplet classifiers, or can be used as an end-to-end track classifier by clustering in an embedded space. A set of post-processing methods improve performance with knowledge of the detector physics. Finally, we present numerical results on the TrackML particle tracking challenge dataset, where our framework shows favorable results in both seeding and track finding.