AI-driven Generation of MALDI-TOF MS for Microbial Characterization

Schmidt-Santiago, Lucía, Rodríguez-Temporal, David, Sevilla-Salcedo, Carlos, Gómez-Verdejo, Vanessa

arXiv.org Artificial Intelligence 

Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (MALDI-TOF MS) has become a cornerstone technology in clinical microbiology, enabling rapid and accurate microbial identification. However, the development of data-driven diagnostic models remains limited by the lack of sufficiently large, balanced, and standardized spectral datasets. This study investigates the use of deep generative models to synthesize realistic MALDI-TOF MS spectra, aiming to overcome data scarcity and support the development of robust machine learning tools in microbiology. We adapt and evaluate three generative models, Variational Autoencoders (MALDIVAEs), Generative Adversarial Networks (MALDIGANs), and Denoising Diffusion Probabilistic Model (MALDIffusion), for the conditional generation of microbial spectra guided by species labels. Generation is conditioned on species labels, and spectral fidelity and diversity are assessed using diverse metrics. Our experiments show that synthetic data generated by MALDIVAE, MALDIGAN, and MALDIffusion are statistically and diagnostically comparable to real measurements, enabling classifiers trained exclusively on synthetic samples to reach performance levels similar to those trained on real data. While all models faithfully reproduce the peak structure and variability of MALDI-TOF spectra, MALDIffusion obtains this fidelity at a substantially higher computational cost, and MALDIGAN shows competitive but slightly less stable behaviour. In contrast, MALDIVAE offers the most favorable balance between realism, stability, and efficiency. Furthermore, augmenting minority species with synthetic spectra markedly improves classification accuracy, effectively mitigating class imbalance and domain mismatch without compromising the authenticity of the generated data.