molecular fingerprint
One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra
Neo, Neng Kai Nigel, Jing, Lim, Preston, Ngoui Yong Zhau, Serene, Koh Xue Ting, Shen, Bingquan
A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST (Goldman et. al., 2023) as the encoder and MolForge (Ucak et. al., 2023) as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 31% / top-10 40% of molecular structures correctly from mass spectra in MassSpecGym (Bushuiev et. al., 2024). We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.
MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation
Han, Yang, Wang, Pengyu, Yu, Kai, Chen, Xin, Chen, Lu
Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.
SoDaDE: Solvent Data-Driven Embeddings with Small Transformer Models
Gibberd, Gabriel Kitso, Folch, Jose Pablo, Chanona, Antonio Del Rio
Computational representations have become crucial in unlocking the recent growth of machine learning algorithms for chemistry. Initially hand-designed, machine learning has shown that meaningful representations can be learnt from data. Chemical datasets are limited and so the representations learnt from data are generic, being trained on broad datasets which contain shallow information on many different molecule types. For example, generic fingerprints lack physical context specific to solvents. However, the use of harmful solvents is a leading climate-related issue in the chemical industry, and there is a surge of interest in green solvent replacement. To empower this research, we propose a new solvent representation scheme by developing Solvent Data Driven Embeddings (SoDaDE). SoDaDE uses a small transformer model and solvent property dataset to create a fingerprint for solvents. To showcase their effectiveness, we use SoDaDE to predict yields on a recently published dataset, outperforming previous representations. We demonstrate through this paper that data-driven fingerprints can be made with small datasets and set-up a workflow that can be explored for other applications.
Towards Rational Pesticide Design with Graph Machine Learning Models for Ecotoxicology
This research focuses on rational pesticide design, using graph machine learning to accelerate the development of safer, eco-friendly agrochemicals, inspired by in silico methods in drug discovery. With an emphasis on ecotoxicology, the initial contributions include the creation of ApisTox, the largest curated dataset on pesticide toxicity to honey bees. We conducted a broad evaluation of machine learning (ML) models for molecular graph classification, including molecular fingerprints, graph kernels, GNNs, and pretrained transformers. The results show that methods successful in medicinal chemistry often fail to generalize to agrochemicals, underscoring the need for domain-specific models and benchmarks. Future work will focus on developing a comprehensive benchmarking suite and designing ML models tailored to the unique challenges of pesticide discovery.
Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning
Praski, Mateusz, Adamczyk, Jakub, Czech, Wojciech
Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embed-dings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.
High-Throughput Computational Screening and Interpretable Machine Learning of Metal-organic Frameworks for Iodine Capture
Tan, Haoyi, Teng, Yukun, Shan, Guangcun
The removal of leaked radioactive iodine isotopes in humid environments holds significant importance in nuclear waste management and nuclear accident mitigation. In this study, high - throughput computational screening and machine learning were combined to reveal the iodine capture performance of 1816 metal - organic framework (MOF) materials under humid air conditions. First ly, the relationship between the structural characteristics of MOF materials (including density, surface area and pore features) and their adsorption properties was explored, with the aim of identifying the optimal structural parameters for iodine capture. Subsequently, two machine learning regression algorithms - Random Forest and CatBoos t, were employed to predict the iodine adsorption capabilities of MOF materials. In addition to 6 structural features, 25 molecular features (encompassing the types of metal and ligand atoms as well as bonding modes) and 8 chemical features (including heat of adsorption and Henry's coefficient) were incorporated to enhance the predicti on accuracy of the machine learning algorithms . Feature importance was assessed to determine the relative influence of various features on iodine adsorption performance, in which the Henry's coefficient and heat of adsorption to iodine were found the two most crucial chemical factors. Furthermore, four types of molecular fingerprint s were introduced for provid ing comprehensive and detailed structural information of MOF materials. The top 20 most significant MACCS molecul ar fingerprints were picked out, revealing that the presence of six - membered ring structures and nitrogen atoms in the MOF framework were the key structural factors that enhance d iodine adsorption, followed by the existence of oxygen atoms. This work combine d high - throughput computation, machine learning, and molecular fingerprints to comprehensively and systematically elucidate the multifaceted factors influencing the iodine adsorption performance of MOFs in humid environments, offering prof ound insight ful guidelines for screening and structural design of advanced MOF materials.
Molecular Fingerprints Are Strong Models for Peptide Function Prediction
Adamczyk, Jakub, Ludynia, Piotr, Czech, Wojciech
We study the effectiveness of molecular fingerprints for peptide property prediction and demonstrate that domain-specific feature extraction from molecular graphs can outperform complex and computationally expensive models such as GNNs, pretrained sequence-based transformers and multimodal ensembles, even without hyperparameter tuning. To this end, we perform a thorough evaluation on 126 datasets, achieving state-of-the-art results on LRGB and 5 other peptide function prediction benchmarks. We show that models based on count variants of ECFP, Topological Torsion, and RDKit molecular fingerprints and LightGBM as classification head are remarkably robust. The strong performance of molecular fingerprints, which are intrinsically very short-range feature encoders, challenges the presumed importance of long-range interactions in peptides. Our conclusion is that the use of molecular fingerprints for larger molecules, such as peptides, can be a computationally feasible, low-parameter, and versatile alternative to sophisticated deep learning models.
Human-level molecular optimization driven by mol-gene evolution
Fang, Jiebin, Mao, Churu, Zhu, Yuchen, Chen, Xiaoming, Hsieh, Chang-Yu, Ma, Zhongjun
De novo molecule generation allows the search for more drug-like hits across a vast chemical space. However, lead optimization is still required, and the process of optimizing molecular structures faces the challenge of balancing structural novelty with pharmacological properties. This study introduces the Deep Genetic Molecular Modification Algorithm (DGMM), which brings structure modification to the level of medicinal chemists. A discrete variational autoencoder (D-VAE) is used in DGMM to encode molecules as quantization code, mol-gene, which incorporates deep learning into genetic algorithms for flexible structural optimization. The mol-gene allows for the discovery of pharmacologically similar but structurally distinct compounds, and reveals the trade-offs of structural optimization in drug discovery. We demonstrate the effectiveness of the DGMM in several applications.
A Python library for efficient computation of molecular fingerprints
Szafarczyk, Michał, Ludynia, Piotr, Kukla, Przemysław
Machine learning solutions are very popular in the field of chemoinformatics, where they have numerous applications, such as novel drug discovery or molecular property prediction. Molecular fingerprints are algorithms commonly used for vectorizing chemical molecules as a part of preprocessing in this kind of solution. However, despite their popularity, there are no libraries that implement them efficiently for large datasets, utilizing modern, multicore architectures. On top of that, most of them do not provide the user with an intuitive interface, or one that would be compatible with other machine learning tools. In this project, we created a Python library that computes molecular fingerprints efficiently and delivers an interface that is comprehensive and enables the user to easily incorporate the library into their existing machine learning workflow. The library enables the user to perform computation on large datasets using parallelism. Because of that, it is possible to perform such tasks as hyperparameter tuning in a reasonable time. We describe tools used in implementation of the library and asses its time performance on example benchmark datasets. Additionally, we show that using molecular fingerprints we can achieve results comparable to state-of-the-art ML solutions even with very simple models.
When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings
Wasi, Azmine Toushik, Karlo, Šerbetar, Islam, Raima, Rafi, Taki Hasan, Chae, Dong-Kyu
Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. These SMILES strings are used in different complex machine learning-based drug-related research and representation works. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification? The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types, proving that complex problems can also be solved with simpler perspectives. Classifying drug types plays a pivotal role in drug discovery research, aiding in the categorization of established drugs and enhancing our understanding of the distinctive features of newly identified or synthesized drugs.