Goto

Collaborating Authors

 tanimoto similarity


TaPrGeMoHigh Ring CountHigh PolarizabilityHigh Drug-likenessHigh Hydrophobicityopernrlgeecertau ttlyeesd

Neural Information Processing Systems

Searching through chemical space is an exceptionally challenging problem because the number of possible molecules grows combinatorially with the number of atoms. Large, autoregressive models trained on databases of chemical compounds have yielded powerful generators, but we still lack robust strategies for generating molecules with desired properties. This molecular search problem closely resembles the "alignment" problem for large language models, though for many chemical tasks we have a specific and easily evaluable reward function. Here, we introduce an algorithm called energy rank alignment (ERA) that leverages an explicit reward function to produce a gradient-based objective that we use to optimize autoregressive policies. We show theoretically that this algorithm is closely related to proximal policy optimization (PPO) and direct preference optimization (DPO), but has a minimizer that converges to an ideal Gibbs-Boltzmann distribution with the reward playing the role of an energy function. Furthermore, this algorithm is highly scalable, does not require reinforcement learning, and performs well relative to DPO when the number of preference observations per pairing is small. We deploy this approach to align molecular transformers and protein language models to generate molecules and protein sequences, respectively, with externally specified properties and find that it does so robustly, searching through diverse parts of chemical space.


Pairwise Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model

Neural Information Processing Systems

In this paper, we propose a flow-based method for learning all-to-all transfer maps among conditional distributions that approximates pairwise optimal transport. The proposed method addresses the challenge of handling the case of continuous conditions, which often involve a large set of conditions with sparse empirical observations per condition. We introduce a novel cost function that enables simultaneous learning of optimal transports for all pairs of conditional distributions. Our method is supported by a theoretical guarantee that, in the limit, it converges to the pairwise optimal transports among infinite pairs of conditional distributions. The learned transport maps are subsequently used to couple data points in conditional flow matching. We demonstrate the effectiveness of this method on synthetic and benchmark datasets, as well as on chemical datasets in which continuous physical properties are defined as conditions.


MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation

Neural Information Processing Systems

Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While largescale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.



Hash Collisions in Molecular Fingerprints: Effects on Property Prediction and Bayesian Optimization

arXiv.org Artificial Intelligence

Molecular fingerprinting methods use hash functions to create fixed-length vector representations of molecules. However, hash collisions cause distinct substructures to be represented with the same feature, leading to overestimates in molecular similarity calculations. We investigate whether using exact fingerprints improves accuracy compared to standard compressed fingerprints in molecular property prediction and Bayesian optimization where the underlying predictive model is a Gaussian process. We find that using exact fingerprints yields a small yet consistent improvement in predictive accuracy on five molecular property prediction benchmarks from the DOCKSTRING dataset. However, these gains did not translate to significant improvements in Bayesian optimization performance.


Graph Diffusion Counterfactual Explanation

arXiv.org Artificial Intelligence

Machine learning models that operate on graph-structured data, such as molecular graphs or social networks, often make accurate predictions but offer little insight into why certain predictions are made. Counterfactual explanations address this challenge by seeking the closest alternative scenario where the model's prediction would change. Although counterfactual explanations are extensively studied in tabular data and computer vision, the graph domain remains comparatively underexplored. Constructing graph counterfactuals is intrinsically difficult because graphs are discrete and non-euclidean objects. We introduce Graph Diffusion Counterfactual Explanation, a novel framework for generating counterfactual explanations on graph data, combining discrete diffusion models and classifier-free guidance. We empirically demonstrate that our method reliably generates in-distribution as well as minimally structurally different counterfactuals for both discrete classification targets and continuous properties.


MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation

arXiv.org Artificial Intelligence

Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.




Phenotypic Profile-Informed Generation of Drug-Like Molecules via Dual-Channel Variational Autoencoders

arXiv.org Artificial Intelligence

The de novo generation of drug-like molecules capable of inducing desirable phenotypic changes is receiving increasing attention. However, previous methods predominantly rely on expression profiles to guide molecule generation, but overlook the perturbative effect of the molecules on cellular contexts. To overcome this limitation, we propose SmilesGEN, a novel generative model based on variational autoencoder (VAE) architecture to generate molecules with potential therapeutic effects. SmilesGEN integrates a pre-trained drug VAE (SmilesNet) with an expression profile VAE (ProfileNet), jointly modeling the interplay between drug perturbations and transcriptional responses in a common latent space. Specifically, ProfileNet is imposed to reconstruct pre-treatment expression profiles when eliminating drug-induced perturbations in the latent space, while SmilesNet is informed by desired expression profiles to generate drug-like molecules. Our empirical experiments demonstrate that SmilesGEN outperforms current state-of-the-art models in generating molecules with higher degree of validity, uniqueness, novelty, as well as higher Tanimoto similarity to known ligands targeting the relevant proteins. Moreover, we evaluate SmilesGEN for scaffold-based molecule optimization and generation of therapeutic agents, and confirmed its superior performance in generating molecules with higher similarity to approved drugs. SmilesGEN establishes a robust framework that leverages gene signatures to generate drug-like molecules that hold promising potential to induce desirable cellular phenotypic changes.