Young, Adamo
MassSpecGym: A benchmark for the discovery and identification of molecules
Bushuiev, Roman, Bushuiev, Anton, de Jonge, Niek F., Young, Adamo, Kretschmer, Fleming, Samusevich, Raman, Heirman, Janne, Wang, Fei, Zhang, Luke, Dührkop, Kai, Ludwig, Marcus, Haupt, Nils A., Kalia, Apurva, Brungs, Corinna, Schmid, Robin, Greiner, Russell, Wang, Bo, Wishart, David S., Liu, Li-Ping, Rousu, Juho, Bittremieux, Wout, Rost, Hannes, Mak, Tytus D., Hassoun, Soha, Huber, Florian, van der Hooft, Justin J. J., Stravs, Michael A., Böcker, Sebastian, Sivic, Josef, Pluskal, Tomáš
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: \textit{de novo} molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at \url{https://github.com/pluskal-lab/MassSpecGym}.
FraGNNet: A Deep Probabilistic Model for Mass Spectrum Prediction
Young, Adamo, Wang, Fei, Wishart, David, Wang, Bo, Röst, Hannes, Greiner, Russ
The process of identifying a compound from its mass spectrum is a critical step in the analysis of complex mixtures. Typical solutions for the mass spectrum to compound (MS2C) problem involve matching the unknown spectrum against a library of known spectrum-molecule pairs, an approach that is limited by incomplete library coverage. Compound to mass spectrum (C2MS) models can improve retrieval rates by augmenting real libraries with predicted spectra. Unfortunately, many existing C2MS models suffer from problems with prediction resolution, scalability, or interpretability. We develop a new probabilistic method for C2MS prediction, FraGNNet, that can efficiently and accurately predict high-resolution spectra. FraGNNet uses a structured latent space to provide insight into the underlying processes that define the spectrum. Our model achieves state-of-the-art performance in terms of prediction error, and surpasses existing C2MS models as a tool for retrieval-based MS2C.
Unleashing the Strengths of Unlabeled Data in Pan-cancer Abdominal Organ Quantification: the FLARE22 Challenge
Ma, Jun, Zhang, Yao, Gu, Song, Ge, Cheng, Ma, Shihao, Young, Adamo, Zhu, Cheng, Meng, Kangkang, Yang, Xin, Huang, Ziyan, Zhang, Fan, Liu, Wentao, Pan, YuanKe, Huang, Shoujin, Wang, Jiacheng, Sun, Mingze, Xu, Weixin, Jia, Dengqiang, Choi, Jae Won, Alves, Natália, de Wilde, Bram, Koehler, Gregor, Wu, Yajun, Wiesenfarth, Manuel, Zhu, Qiongjie, Dong, Guoqiang, He, Jian, Consortium, the FLARE Challenge, Wang, Bo
Quantitative organ assessment is an essential step in automated abdominal disease diagnosis and treatment planning. Artificial intelligence (AI) has shown great potential to automatize this process. However, most existing AI algorithms rely on many expert annotations and lack a comprehensive evaluation of accuracy and efficiency in real-world multinational settings. To overcome these limitations, we organized the FLARE 2022 Challenge, the largest abdominal organ analysis challenge to date, to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms. We constructed an intercontinental and multinational dataset from more than 50 medical groups, including Computed Tomography (CT) scans with different races, diseases, phases, and manufacturers. We independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements. They also enabled automatic extraction of key organ biology features, which was labor-intensive with traditional manual measurements. This opens the potential to use unlabeled data to boost performance and alleviate annotation shortages for modern AI models. Abdominal organs are high cancer incidence areas, such as liver cancer, kidney cancer, pancreas cancer, and gastric cancer [1]. Computed Tomography (CT) scanning has been a major imaging technology for the diagnosis and treatment of abdominal cancer because it can yield important prognostic information with fast imaging speed for cancer patients, which has been recommended by many clinical treatment guidelines. In order to quantify abdominal organs, radiologists and clinicians need to manually delineate organ boundaries in each slice of the 3D CT scans [2], [3]. However, manual segmentation is time-consuming and inherently subjective with inter-and intra-expert variability.
MassFormer: Tandem Mass Spectrum Prediction for Small Molecules using Graph Transformers
Young, Adamo, Wang, Bo, Röst, Hannes
Tandem mass spectra capture fragmentation patterns that provide key structural information about a molecule. Although mass spectrometry is applied in many areas, the vast majority of small molecules lack experimental reference spectra. For over seventy years, spectrum prediction has remained a key challenge in the field. Existing deep learning methods do not leverage global structure in the molecule, potentially resulting in difficulties when generalizing to new data. In this work we propose a new model, MassFormer, for accurately predicting tandem mass spectra. MassFormer uses a graph transformer architecture to model long-distance relationships between atoms in the molecule. The transformer module is initialized with parameters obtained through a chemical pre-training task, then fine-tuned on spectral data. MassFormer outperforms competing approaches for spectrum prediction on multiple datasets, and is able to recover prior knowledge about the effect of collision energy on the spectrum. By employing gradient-based attribution methods, we demonstrate that the model can identify relationships between fragment peaks. To further highlight MassFormer's utility, we show that it can match or exceed existing prediction-based methods on two spectrum identification tasks. We provide open-source implementations of our model and baseline approaches, with the goal of encouraging future research in this area.