Brace, Alexander
Connecting Large Language Model Agent to High Performance Computing Resource
Ma, Heng, Brace, Alexander, Siebenschuh, Carlo, Pauloski, Greg, Foster, Ian, Ramanathan, Arvind
The Large Language Model agent workflow enables the LLM to invoke tool functions to increase the performance on specific scientific domain questions. To tackle large scale of scientific research, it requires access to computing resource and parallel computing setup. In this work, we implemented Parsl to the LangChain/LangGraph tool call setup, to bridge the gap between the LLM agent to the computing resource. Two tool call implementations were set up and tested on both local workstation and HPC environment on Polaris/ALCF. The first implementation with Parsl-enabled LangChain tool node queues the tool functions concurrently to the Parsl workers for parallel execution. The second configuration is implemented by converting the tool functions into Parsl ensemble functions, and is more suitable for large task on super computer environment. The LLM agent workflow was prompted to run molecular dynamics simulations, with different protein structure and simulation conditions. These results showed the LLM agent tools were managed and executed concurrently by Parsl on the available computing resource.
BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery
John, Peter St., Lin, Dejun, Binder, Polina, Greaves, Malcolm, Shah, Vega, John, John St., Lange, Adrian, Hsu, Patrick, Illango, Rajesh, Ramanathan, Arvind, Anandkumar, Anima, Brookes, David H, Busia, Akosua, Mahajan, Abhishaike, Malina, Stephen, Prasad, Neha, Sinai, Sam, Edwards, Lindsay, Gaudelet, Thomas, Regep, Cristian, Steinegger, Martin, Rost, Burkhard, Brace, Alexander, Hippe, Kyle, Naef, Luca, Kamata, Keisuke, Armstrong, George, Boyd, Kevin, Cao, Zhonglin, Chou, Han-Yi, Chu, Simon, Costa, Allan dos Santos, Darabi, Sajad, Dawson, Eric, Didi, Kieran, Fu, Cong, Geiger, Mario, Gill, Michelle, Hsu, Darren, Kaushik, Gagan, Korshunova, Maria, Kothen-Hill, Steven, Lee, Youhan, Liu, Meng, Livne, Micha, McClure, Zachary, Mitchell, Jonathan, Moradzadeh, Alireza, Mosafi, Ohad, Nashed, Youssef, Paliwal, Saee, Peng, Yuxing, Rabhi, Sara, Ramezanghorbani, Farhad, Reidenbach, Danny, Ricketts, Camir, Roland, Brian, Shah, Kushal, Shimko, Tyler, Sirelkhatim, Hassan, Srinivasan, Savitha, Stern, Abraham C, Toczydlowska, Dorota, Veccham, Srimukh Prasad, Venanzi, Niccolรฒ Alberto Elia, Vorontsov, Anton, Wilber, Jared, Wilkinson, Isabel, Wong, Wei Jing, Xue, Eva, Ye, Cory, Yu, Xin, Zhang, Yang, Zhou, Guoqing, Zandstein, Becca, Dallago, Christian, Trentini, Bruno, Kucukbenli, Emine, Paliwal, Saee, Rvachov, Timur, Calleja, Eddie, Israeli, Johnny, Clifford, Harry, Haukioja, Risto, Haemel, Nicholas, Tretina, Kyle, Tadimeti, Neha, Costa, Anthony B
Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.
LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
Khan, Arham, Underwood, Robert, Siebenschuh, Carlo, Babuji, Yadu, Ajith, Aswathy, Hippe, Kyle, Gokdemir, Ozan, Brace, Alexander, Chard, Kyle, Foster, Ian
Deduplication is a major focus for assembling and curating training datasets for large language models (LLM) -- detecting and eliminating additional instances of the same content -- in large collections of technical documents. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Contemporary approaches to document-level deduplication are often extremely expensive in both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same deduplication performance as MinhashLSH with only a marginal increase in false positives (as low as 1e-5 in our experiments); demonstrates competitive runtime (270\% faster than MinhashLSH on peS2o); and, crucially, uses just 0.6\% of the disk space required by MinhashLSH to deduplicate peS2o. We demonstrate that this space advantage scales with increased dataset size -- at the extreme scale of several billion documents, LSHBloom promises a 250\% speedup and a 54$\times$ space advantage over traditional MinHashLSH scaling deduplication of text datasets to many billions of documents.
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
Song, Shuaiwen Leon, Kruft, Bonnie, Zhang, Minjia, Li, Conglong, Chen, Shiyang, Zhang, Chengming, Tanaka, Masahiro, Wu, Xiaoxia, Rasley, Jeff, Awan, Ammar Ahmad, Holmes, Connor, Cai, Martin, Ghanem, Adam, Zhou, Zhongzhu, He, Yuxiong, Luferenko, Pete, Kumar, Divya, Weyn, Jonathan, Zhang, Ruixiong, Klocek, Sylwester, Vragov, Volodymyr, AlQuraishi, Mohammed, Ahdritz, Gustaf, Floristean, Christina, Negri, Cristina, Kotamarthi, Rao, Vishwanath, Venkatram, Ramanathan, Arvind, Foreman, Sam, Hippe, Kyle, Arcomano, Troy, Maulik, Romit, Zvyagin, Maxim, Brace, Alexander, Zhang, Bin, Bohorquez, Cindy Orozco, Clyde, Austin, Kale, Bharat, Perez-Rivera, Danilo, Ma, Heng, Mann, Carla M., Irvin, Michael, Pauloski, J. Gregory, Ward, Logan, Hayot, Valerie, Emani, Murali, Xie, Zhen, Lin, Diangen, Shukla, Maulik, Foster, Ian, Davis, James J., Papka, Michael E., Brettin, Thomas, Balaprakash, Prasanna, Tourassi, Gina, Gounley, John, Hanson, Heidi, Potok, Thomas E, Pasini, Massimiliano Lupo, Evans, Kate, Lu, Dan, Lunga, Dalton, Yin, Junqi, Dash, Sajal, Wang, Feiyi, Shankar, Mallikarjun, Lyngaas, Isaac, Wang, Xiao, Cong, Guojing, Zhang, Pei, Fan, Ming, Liu, Siyan, Hoisie, Adolfy, Yoo, Shinjae, Ren, Yihui, Tang, William, Felker, Kyle, Svyatkovskiy, Alexey, Liu, Hang, Aji, Ashwin, Dalton, Angela, Schulte, Michael, Schulz, Karl, Deng, Yuntian, Nie, Weili, Romero, Josh, Dallago, Christian, Vahdat, Arash, Xiao, Chaowei, Gibbs, Thomas, Anandkumar, Anima, Stevens, Rick
In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). In this paper, we showcase the early progress we made with DeepSpeed4Science in addressing two of the critical system challenges in structural biology research.
Linking the Dynamic PicoProbe Analytical Electron-Optical Beam Line / Microscope to Supercomputers
Brace, Alexander, Vescovi, Rafael, Chard, Ryan, Saint, Nickolaus D., Ramanathan, Arvind, Zaluzec, Nestor J., Foster, Ian
The Dynamic PicoProbe at Argonne National Laboratory is undergoing upgrades that will enable it to produce up to 100s of GB of data per day. While this data is highly important for both fundamental science and industrial applications, there is currently limited on-site infrastructure to handle these high-volume data streams. We address this problem by providing a software architecture capable of supporting large-scale data transfers to the neighboring supercomputers at the Argonne Leadership Computing Facility. To prepare for future scientific workflows, we implement two instructive use cases for hyperspectral and spatiotemporal datasets, which include: (i) off-site data transfer, (ii) machine learning/artificial intelligence and traditional data analysis approaches, and (iii) automatic metadata extraction and cataloging of experimental results. This infrastructure supports expected workloads and also provides domain scientists the ability to reinterrogate data from past experiments to yield additional scientific value and derive new insights.