digress
DMol: A Highly Efficient and Chemical Motif-Preserving Molecule Generation Platform
Niu, Peizhi, Wang, Yu-Hsiang, Rana, Vishal, Rupakheti, Chetan, Pandey, Abhishek, Milenkovic, Olgica
We introduce a new graph diffusion model for small molecule generation, DMol, which outperforms the state-of-the-art DiGress model in terms of validity by roughly 1.5% across all benchmarking datasets while reducing the number of diffusion steps by at least 10-fold, and the running time to roughly one half. The performance improvements are a result of a careful change in the objective function and a graph noise scheduling approach which, at each diffusion step, allows one to only change a subset of nodes of varying size in the molecule graph. Another relevant property of the method is that it can be easily combined with junction-tree-like graph representations that arise by compressing a collection of relevant ring structures into supernodes. Unlike classical junction-tree techniques that involve VAEs and require complicated reconstruction steps, compressed DMol directly performs graph diffusion on a graph that compresses only a carefully selected set of frequent carbon rings into supernodes, which results in straightforward sample generation. This compressed DMol method offers additional validity improvements over generic DMol of roughly 2%, increases the novelty of the method, and further improves the running time due to reductions in the graph size.
A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules
Ruiz-Botella, Manuel, Sales-Pardo, Marta, Guimerà, Roger
Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.
- Research Report (1.00)
- Overview (1.00)
Backdoor Attacks on Discrete Graph Diffusion Models
Wang, Jiawen, Karim, Samin, Hong, Yuan, Wang, Binghui
Diffusion models are powerful generative models in continuous data domains such as image and video data. Discrete graph diffusion models (DGDMs) have recently extended them for graph generation, which are crucial in fields like molecule and protein modeling, and obtained the SOTA performance. However, it is risky to deploy DGDMs for safety-critical applications (e.g., drug discovery) without understanding their security vulnerabilities. In this work, we perform the first study on graph diffusion models against backdoor attacks, a severe attack that manipulates both the training and inference/generation phases in graph diffusion models. We first define the threat model, under which we design the attack such that the backdoored graph diffusion model can generate 1) high-quality graphs without backdoor activation, 2) effective, stealthy, and persistent backdoored graphs with backdoor activation, and 3) graphs that are permutation invariant and exchangeable--two core properties in graph generative models. 1) and 2) are validated via empirical evaluations without and with backdoor defenses, while 3) is validated via theoretical results.
- North America > United States > Illinois (0.04)
- North America > United States > Connecticut (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (5 more...)
FragFM: Efficient Fragment-Based Molecular Generation via Discrete Flow Matching
Lee, Joongwon, Kim, Seonghwan, Kim, Wou Youn
A BSTRACT We introduce FragFM, a novel fragment-based discrete flow matching framework for molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoding mechanism to reconstruct atom-level details. This approach reduces computational complexity while maintaining high chemical validity, enabling more efficient and scalable molecular generation. Notably, FragFM achieves over 99% validity with significantly fewer sampling steps, improving scalability while preserving molecular diversity. These results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space. 1 I NTRODUCTION Deep generative models, such as diffusion and flow matching, have demonstrated remarkable success across domains like images (Nichol et al., 2021; Rombach et al., 2022; Ho et al., 2020), text (Li et al., 2022), and videos (Hu & Xu, 2023; Ho et al., 2022). Recently, their application to molecular graph generation has gained attention, where they aim to generate chemically valid molecules by leveraging the structural properties of molecular graphs (Jo et al., 2022; Vignac et al., 2022; Qin et al., 2024). However, existing atom-based generative models face scalability challenges, particularly in generating large and complex molecules. The quadratic growth of edges as graph size increases results in computational inefficiencies. At the same time, the inherent sparsity of chemical bonds makes accurate edge prediction more complex, often leading to unrealistic molecular structures or invalid connectivity constraints (Qin et al., 2023; Chen et al., 2023). Additionally, graph neural networks (GNNs) struggle to capture topological features such as rings and loops, leading to deviations from chemically valid structures. While various methods incorporate auxiliary features (e.g., spectral, ring, and valency information) to mitigate these issues, they do not fully resolve the sparsity and scalability bottlenecks (Vignac et al., 2022).
Vertical Validation: Evaluating Implicit Generative Models for Graphs on Thin Support Regions
Elkady, Mai, Bui, Thu, Ribeiro, Bruno, Inouye, David I.
There has been a growing excitement that implicit graph generative models could be used to design or discover new molecules for medicine or material design. Because these molecules have not been discovered, they naturally lie in unexplored or scarcely supported regions of the distribution of known molecules. However, prior evaluation methods for implicit graph generative models have focused on validating statistics computed from the thick support (e.g., mean and variance of a graph property). Therefore, there is a mismatch between the goal of generating novel graphs and the evaluation methods. To address this evaluation gap, we design a novel evaluation method called Vertical Validation (VV) that systematically creates thin support regions during the train-test splitting procedure and then reweights generated samples so that they can be compared to the held-out test data. This procedure can be seen as a generalization of the standard train-test procedure except that the splits are dependent on sample features. We demonstrate that our method can be used to perform model selection if performance on thin support regions is the desired goal. As a side benefit, we also show that our approach can better detect overfitting as exemplified by memorization.
- North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
- North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)
Pard: Permutation-Invariant Autoregressive Diffusion for Graph Generation
Zhao, Lingxiao, Ding, Xueying, Akoglu, Leman
Graph generation has been dominated by autoregressive models due to their simplicity and effectiveness, despite their sensitivity to ordering. Yet diffusion models have garnered increasing attention, as they offer comparable performance while being permutation-invariant. Current graph diffusion models generate graphs in a one-shot fashion, but they require extra features and thousands of denoising steps to achieve optimal performance. We introduce PARD, a Permutation-invariant Auto Regressive Diffusion model that integrates diffusion models with autoregressive methods. PARD harnesses the effectiveness and efficiency of the autoregressive model while maintaining permutation invariance without ordering sensitivity. Specifically, we show that contrary to sets, elements in a graph are not entirely unordered and there is a unique partial order for nodes and edges. With this partial order, PARD generates a graph in a block-by-block, autoregressive fashion, where each block's probability is conditionally modeled by a shared diffusion model with an equivariant network. To ensure efficiency while being expressive, we further propose a higher-order graph transformer, which integrates transformer with PPGN. Like GPT, we extend the higher-order graph transformer to support parallel training of all blocks. Without any extra features, PARD achieves state-of-the-art performance on molecular and non-molecular datasets, and scales to large datasets like MOSES containing 1.9M molecules.
Classifier-free graph diffusion for molecular property targeting
Ninniri, Matteo, Podda, Marco, Bacciu, Davide
This work focuses on the task of property targeting: that is, generating molecules conditioned on target chemical properties to expedite candidate screening for novel drug and materials development. DiGress is a recent diffusion model for molecular graphs whose distinctive feature is allowing property targeting through classifier-based (CB) guidance. While CB guidance may work to generate molecular-like graphs, we hint at the fact that its assumptions apply poorly to the chemical domain. Based on this insight we propose a classifier-free DiGress (FreeGress), which works by directly injecting the conditioning information into the training process. CF guidance is convenient given its less stringent assumptions and since it does not require to train an auxiliary property regressor, thus halving the number of trainable parameters in the model. We empirically show that our model yields up to 79% improvement in Mean Absolute Error with respect to DiGress on property targeting tasks on QM9 and ZINC-250k benchmarks. As an additional contribution, we propose a simple yet powerful approach to improve chemical validity of generated samples, based on the observation that certain chemical properties such as molecular weight correlate with the number of atoms in molecules.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
Size Matters: Large Graph Generation with HiGGs
Davies, Alex O., Ajmeri, Nirav S., Filho, Telmo M. Silva
Large graphs are present in a variety of domains, including social networks, civil infrastructure, and the physical sciences to name a few. Graph generation is similarly widespread, with applications in drug discovery, network analysis and synthetic datasets among others. While GNN (Graph Neural Network) models have been applied in these domains their high in-memory costs restrict them to small graphs. Conversely less costly rule-based methods struggle to reproduce complex structures. We propose HIGGS (Hierarchical Generation of Graphs) as a model-agnostic framework of producing large graphs with realistic local structures. HIGGS uses GNN models with conditional generation capabilities to sample graphs in hierarchies of resolution. As a result HIGGS has the capacity to extend the scale of generated graphs from a given GNN model by quadratic order. As a demonstration we implement HIGGS using DiGress, a recent graph-diffusion model, including a novel edge-predictive-diffusion variant edge-DiGress. We use this implementation to generate categorically attributed graphs with tens of thousands of nodes. These HIGGS generated graphs are far larger than any previously produced using GNNs. Despite this jump in scale we demonstrate that the graphs produced by HIGGS are, on the local scale, more realistic than those from the rule-based model BTER.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (5 more...)
Discriminator Guidance for Autoregressive Diffusion Models
Kelvinius, Filip Ekström, Lindsten, Fredrik
We introduce discriminator guidance in the setting of Autoregressive Diffusion Models. The use of a discriminator to guide a diffusion process has previously been used for continuous diffusion models, and in this work we derive ways of using a discriminator together with a pretrained generative model in the discrete case. First, we show that using an optimal discriminator will correct the pretrained model and enable exact sampling from the underlying data distribution. Second, to account for the realistic scenario of using a sub-optimal discriminator, we derive a sequential Monte Carlo algorithm which iteratively takes the predictions from the discrimiator into account during the generation process. We test these approaches on the task of generating molecular graphs and show how the discriminator improves the generative performance over using only the pretrained model.
Tertiary Lymphoid Structures Generation through Graph-based Diffusion
Madeira, Manuel, Thanou, Dorina, Frossard, Pascal
Graph-based representation approaches have been proven to be successful in the analysis of biomedical data, due to their capability of capturing intricate dependencies between biological entities, such as the spatial organization of different cell types in a tumor tissue. However, to further enhance our understanding of the underlying governing biological mechanisms, it is important to accurately capture the actual distributions of such complex data. Graph-based deep generative models are specifically tailored to accomplish that. In this work, we leverage state-of-the-art graph-based diffusion models to generate biologically meaningful cell-graphs. In particular, we show that the adopted graph diffusion model is able to accurately learn the distribution of cells in terms of their tertiary lymphoid structures (TLS) content, a well-established biomarker for evaluating the cancer progression in oncology research. Additionally, we further illustrate the utility of the learned generative models for data augmentation in a TLS classification task. To the best of our knowledge, this is the first work that leverages the power of graph diffusion models in generating meaningful biological cell structures.
- Europe > Switzerland > Vaud > Lausanne (0.04)
- North America > Saint Martin (0.04)