AITopics | molecular fingerprint

Collaborating Authors

molecular fingerprint

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra

Neo, Neng Kai Nigel, Jing, Lim, Preston, Ngoui Yong Zhau, Serene, Koh Xue Ting, Shen, Bingquan

arXiv.org Artificial IntelligenceNov-4-2025

A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST (Goldman et. al., 2023) as the encoder and MolForge (Ucak et. al., 2023) as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 31% / top-10 40% of molecular structures correctly from mass spectra in MassSpecGym (Bushuiev et. al., 2024). We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.

artificial intelligence, fingerprint, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2508.0418

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation

Han, Yang, Wang, Pengyu, Yu, Kai, Chen, Xin, Chen, Lu

arXiv.org Artificial IntelligenceOct-24-2025

Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.20615

Country: Asia > China (0.15)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SoDaDE: Solvent Data-Driven Embeddings with Small Transformer Models

Gibberd, Gabriel Kitso, Folch, Jose Pablo, Chanona, Antonio Del Rio

arXiv.org Artificial IntelligenceSep-29-2025

Computational representations have become crucial in unlocking the recent growth of machine learning algorithms for chemistry. Initially hand-designed, machine learning has shown that meaningful representations can be learnt from data. Chemical datasets are limited and so the representations learnt from data are generic, being trained on broad datasets which contain shallow information on many different molecule types. For example, generic fingerprints lack physical context specific to solvents. However, the use of harmful solvents is a leading climate-related issue in the chemical industry, and there is a surge of interest in green solvent replacement. To empower this research, we propose a new solvent representation scheme by developing Solvent Data Driven Embeddings (SoDaDE). SoDaDE uses a small transformer model and solvent property dataset to create a fingerprint for solvents. To showcase their effectiveness, we use SoDaDE to predict yields on a recently published dataset, outperforming previous representations. We demonstrate through this paper that data-driven fingerprints can be made with small datasets and set-up a workflow that can be explored for other applications.

artificial intelligence, machine learning, representation, (14 more...)

arXiv.org Artificial Intelligence

2509.22302

Genre: Research Report (0.50)

Industry: Materials > Chemicals > Commodity Chemicals > Petrochemicals (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Towards Rational Pesticide Design with Graph Machine Learning Models for Ecotoxicology

Adamczyk, Jakub

arXiv.org Artificial IntelligenceSep-24-2025

This research focuses on rational pesticide design, using graph machine learning to accelerate the development of safer, eco-friendly agrochemicals, inspired by in silico methods in drug discovery. With an emphasis on ecotoxicology, the initial contributions include the creation of ApisTox, the largest curated dataset on pesticide toxicity to honey bees. We conducted a broad evaluation of machine learning (ML) models for molecular graph classification, including molecular fingerprints, graph kernels, GNNs, and pretrained transformers. The results show that methods successful in medicinal chemistry often fail to generalize to agrochemicals, underscoring the need for domain-specific models and benchmarks. Future work will focus on developing a comprehensive benchmarking suite and designing ML models tailored to the unique challenges of pesticide discovery.

artificial intelligence, dataset, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2509.18703

Country:

North America > United States (0.48)
Europe > Poland > Lesser Poland Province > Kraków (0.15)

Genre: Research Report > New Finding (0.49)

Industry:

Materials > Chemicals > Agricultural Chemicals (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Food & Agriculture > Agriculture > Pest Control (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

Praski, Mateusz, Adamczyk, Jakub, Czech, Wojciech

arXiv.org Artificial IntelligenceSep-16-2025

Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embed-dings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.06199

Country: Europe > Poland (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

High-Throughput Computational Screening and Interpretable Machine Learning of Metal-organic Frameworks for Iodine Capture

Tan, Haoyi, Teng, Yukun, Shan, Guangcun

arXiv.org Artificial IntelligenceFeb-14-2025

The removal of leaked radioactive iodine isotopes in humid environments holds significant importance in nuclear waste management and nuclear accident mitigation. In this study, high - throughput computational screening and machine learning were combined to reveal the iodine capture performance of 1816 metal - organic framework (MOF) materials under humid air conditions. First ly, the relationship between the structural characteristics of MOF materials (including density, surface area and pore features) and their adsorption properties was explored, with the aim of identifying the optimal structural parameters for iodine capture. Subsequently, two machine learning regression algorithms - Random Forest and CatBoos t, were employed to predict the iodine adsorption capabilities of MOF materials. In addition to 6 structural features, 25 molecular features (encompassing the types of metal and ligand atoms as well as bonding modes) and 8 chemical features (including heat of adsorption and Henry's coefficient) were incorporated to enhance the predicti on accuracy of the machine learning algorithms . Feature importance was assessed to determine the relative influence of various features on iodine adsorption performance, in which the Henry's coefficient and heat of adsorption to iodine were found the two most crucial chemical factors. Furthermore, four types of molecular fingerprint s were introduced for provid ing comprehensive and detailed structural information of MOF materials. The top 20 most significant MACCS molecul ar fingerprints were picked out, revealing that the presence of six - membered ring structures and nitrogen atoms in the MOF framework were the key structural factors that enhance d iodine adsorption, followed by the existence of oxygen atoms. This work combine d high - throughput computation, machine learning, and molecular fingerprints to comprehensively and systematically elucidate the multifaceted factors influencing the iodine adsorption performance of MOFs in humid environments, offering prof ound insight ful guidelines for screening and structural design of advanced MOF materials.

adsorption, descriptor, mof material, (12 more...)

arXiv.org Artificial Intelligence

2502.15764

Country:

North America > United States > Idaho (0.04)
Asia > China > Hong Kong (0.04)
Europe > Austria > Vienna (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Water & Waste Management (1.00)
Health & Medicine (1.00)
Energy > Renewable (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Add feedback

Molecular Fingerprints Are Strong Models for Peptide Function Prediction

Adamczyk, Jakub, Ludynia, Piotr, Czech, Wojciech

arXiv.org Artificial IntelligenceJan-29-2025

We study the effectiveness of molecular fingerprints for peptide property prediction and demonstrate that domain-specific feature extraction from molecular graphs can outperform complex and computationally expensive models such as GNNs, pretrained sequence-based transformers and multimodal ensembles, even without hyperparameter tuning. To this end, we perform a thorough evaluation on 126 datasets, achieving state-of-the-art results on LRGB and 5 other peptide function prediction benchmarks. We show that models based on count variants of ECFP, Topological Torsion, and RDKit molecular fingerprints and LightGBM as classification head are remarkably robust. The strong performance of molecular fingerprints, which are intrinsically very short-range feature encoders, challenges the presumed importance of long-range interactions in peptides. Our conclusion is that the use of molecular fingerprints for larger molecules, such as peptides, can be a computationally feasible, low-parameter, and versatile alternative to sophisticated deep learning models.

machine learning, natural language, peptide, (17 more...)

arXiv.org Artificial Intelligence

2501.17901

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Poland > Lesser Poland Province > Kraków (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Human-level molecular optimization driven by mol-gene evolution

Fang, Jiebin, Mao, Churu, Zhu, Yuchen, Chen, Xiaoming, Hsieh, Chang-Yu, Ma, Zhongjun

arXiv.org Artificial IntelligenceJun-12-2024

De novo molecule generation allows the search for more drug-like hits across a vast chemical space. However, lead optimization is still required, and the process of optimizing molecular structures faces the challenge of balancing structural novelty with pharmacological properties. This study introduces the Deep Genetic Molecular Modification Algorithm (DGMM), which brings structure modification to the level of medicinal chemists. A discrete variational autoencoder (D-VAE) is used in DGMM to encode molecules as quantization code, mol-gene, which incorporates deep learning into genetic algorithms for flexible structural optimization. The mol-gene allows for the discovery of pharmacologically similar but structurally distinct compounds, and reveals the trade-offs of structural optimization in drug discovery. We demonstrate the effectiveness of the DGMM in several applications.

algorithm, molecule, optimization, (15 more...)

arXiv.org Artificial Intelligence

2406.1291

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

A Python library for efficient computation of molecular fingerprints

Szafarczyk, Michał, Ludynia, Piotr, Kukla, Przemysław

arXiv.org Artificial IntelligenceMar-27-2024

Machine learning solutions are very popular in the field of chemoinformatics, where they have numerous applications, such as novel drug discovery or molecular property prediction. Molecular fingerprints are algorithms commonly used for vectorizing chemical molecules as a part of preprocessing in this kind of solution. However, despite their popularity, there are no libraries that implement them efficiently for large datasets, utilizing modern, multicore architectures. On top of that, most of them do not provide the user with an intuitive interface, or one that would be compatible with other machine learning tools. In this project, we created a Python library that computes molecular fingerprints efficiently and delivers an interface that is comprehensive and enables the user to easily incorporate the library into their existing machine learning workflow. The library enables the user to perform computation on large datasets using parallelism. Because of that, it is possible to perform such tasks as hyperparameter tuning in a reasonable time. We describe tools used in implementation of the library and asses its time performance on example benchmark datasets. Additionally, we show that using molecular fingerprints we can achieve results comparable to state-of-the-art ML solutions even with very simple models.

fingerprint, library, molecule, (14 more...)

arXiv.org Artificial Intelligence

2403.19718

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Education (0.86)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.30)
Health & Medicine > Therapeutic Area > Immunology (0.30)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings

Wasi, Azmine Toushik, Karlo, Šerbetar, Islam, Raima, Rafi, Taki Hasan, Chae, Dong-Kyu

arXiv.org Machine LearningMar-27-2024

Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. These SMILES strings are used in different complex machine learning-based drug-related research and representation works. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification? The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types, proving that complex problems can also be solved with simpler perspectives. Classifying drug types plays a pivotal role in drug discovery research, aiding in the categorization of established drugs and enhancing our understanding of the distinctive features of newly identified or synthesized drugs.

classification, drug classification, representation, (14 more...)

arXiv.org Machine Learning

2403.12984

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Bangladesh (0.05)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback