Toczydlowska, Dorota
BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery
John, Peter St., Lin, Dejun, Binder, Polina, Greaves, Malcolm, Shah, Vega, John, John St., Lange, Adrian, Hsu, Patrick, Illango, Rajesh, Ramanathan, Arvind, Anandkumar, Anima, Brookes, David H, Busia, Akosua, Mahajan, Abhishaike, Malina, Stephen, Prasad, Neha, Sinai, Sam, Edwards, Lindsay, Gaudelet, Thomas, Regep, Cristian, Steinegger, Martin, Rost, Burkhard, Brace, Alexander, Hippe, Kyle, Naef, Luca, Kamata, Keisuke, Armstrong, George, Boyd, Kevin, Cao, Zhonglin, Chou, Han-Yi, Chu, Simon, Costa, Allan dos Santos, Darabi, Sajad, Dawson, Eric, Didi, Kieran, Fu, Cong, Geiger, Mario, Gill, Michelle, Hsu, Darren, Kaushik, Gagan, Korshunova, Maria, Kothen-Hill, Steven, Lee, Youhan, Liu, Meng, Livne, Micha, McClure, Zachary, Mitchell, Jonathan, Moradzadeh, Alireza, Mosafi, Ohad, Nashed, Youssef, Paliwal, Saee, Peng, Yuxing, Rabhi, Sara, Ramezanghorbani, Farhad, Reidenbach, Danny, Ricketts, Camir, Roland, Brian, Shah, Kushal, Shimko, Tyler, Sirelkhatim, Hassan, Srinivasan, Savitha, Stern, Abraham C, Toczydlowska, Dorota, Veccham, Srimukh Prasad, Venanzi, Niccolò Alberto Elia, Vorontsov, Anton, Wilber, Jared, Wilkinson, Isabel, Wong, Wei Jing, Xue, Eva, Ye, Cory, Yu, Xin, Zhang, Yang, Zhou, Guoqing, Zandstein, Becca, Dallago, Christian, Trentini, Bruno, Kucukbenli, Emine, Paliwal, Saee, Rvachov, Timur, Calleja, Eddie, Israeli, Johnny, Clifford, Harry, Haukioja, Risto, Haemel, Nicholas, Tretina, Kyle, Tadimeti, Neha, Costa, Anthony B
Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.
Parsimonious Feature Extraction Methods: Extending Robust Probabilistic Projections with Generalized Skew-t
Toczydlowska, Dorota, Peters, Gareth W., Shevchenko, Pavel V.
The study focuses on extension to the approach of Principal Component Analysis (PCA), as defined in [1], [2] or [3]. PCA and related matrix factorisation methodologies are widely used in data-rich environments for dimensionality reduction, data compression, feature-extraction techniques or data de-noising. The methodologies identify a lower-dimensional linear subspace to represent the data, which captures second-order dominant information contained in high-dimensional data sets. PCA can be viewed as a matrix factorisation problem which aims to learn the lower-dimensional representation of the data, preserving its Euclidean structure. However, in the presence of either a non-Gaussian distribution of the data generating distribution or in the presence of outliers which corrupt the data, the standard PCA methodology provides biased information about the lower-rank representation. In many applications, the stochastic noise or observation errors in the data set are assumed to be, in some sense, "well-behaved"; for instance, additive, light-tailed, symmetric and zero-mean. When non-robust feature extraction methods are naively utilised in the presence of violations of these implicit statistical assumptions, the information contained in the extracted features cannot be relied upon, resulting in misleading inference. Therefore, it is critical to ensure that the feature extraction captures information about correct characteristics of the process generating the data. In the following study, we relax the inherent assumption of "well-behaved" observation noise by developing a class of robust estimators that can withstand violations of such assumptions, which routinely arise in real data sets.