Podda, Marco
Learning to quantify graph nodes
Micheli, Alessio, Moreo, Alejandro, Podda, Marco, Sebastiani, Fabrizio, Simoni, William, Tortorella, Domenico
Quantification (Esuli et al. 2023; Gonzรกlez et al. 2017) is the machine learning task of estimating the prevalence (or proportions) of each class in a dataset. Unlike standard classification, which focuses on predicting a label for each individual example, quantification works at the aggregate level by estimating the overall fraction of unlabeled instances belonging to each class. Real-world applications of quantification include but are not limited to ecological modeling (Gonzรกlez et al. 2017) (i.e., to characterize entire populations of living species) and market research (Sebastiani 2018) (i.e., for estimating market shares of different products or services). Quantification methods are explicitly designed to account for dataset shift, which occurs when the statistical properties of the training data differ from those of the test data, due to changes in input features, labels, or their relationships. Most quantification methods are tailored to one specific type of dataset shift, namely, prior probability shift (PPS), also referred to as "label shift" (Storkey 2009).
Towards Efficient Molecular Property Optimization with Graph Energy Based Models
Miglior, Luca, Simone, Lorenzo, Podda, Marco, Bacciu, Davide
Optimizing chemical properties is a challenging task due to the vastness and complexity of chemical space. Here, we present a generative energy-based architecture for implicit chemical property optimization, designed to efficiently generate molecules that satisfy target properties without explicit conditional generation. We use Graph Energy Based Models and a training approach that does not require property labels.
Classifier-free graph diffusion for molecular property targeting
Ninniri, Matteo, Podda, Marco, Bacciu, Davide
This work focuses on the task of property targeting: that is, generating molecules conditioned on target chemical properties to expedite candidate screening for novel drug and materials development. DiGress is a recent diffusion model for molecular graphs whose distinctive feature is allowing property targeting through classifier-based (CB) guidance. While CB guidance may work to generate molecular-like graphs, we hint at the fact that its assumptions apply poorly to the chemical domain. Based on this insight we propose a classifier-free DiGress (FreeGress), which works by directly injecting the conditioning information into the training process. CF guidance is convenient given its less stringent assumptions and since it does not require to train an auxiliary property regressor, thus halving the number of trainable parameters in the model. We empirically show that our model yields up to 79% improvement in Mean Absolute Error with respect to DiGress on property targeting tasks on QM9 and ZINC-250k benchmarks. As an additional contribution, we propose a simple yet powerful approach to improve chemical validity of generated samples, based on the observation that certain chemical properties such as molecular weight correlate with the number of atoms in molecules.
A Deep Generative Model for Fragment-Based Molecule Generation
Podda, Marco, Bacciu, Davide, Micheli, Alessio
Molecule generation is a challenging open problem in cheminformatics. Currently, deep generative approaches addressing the challenge belong to two broad categories, differing in how molecules are represented. One approach encodes molecular graphs as strings of text, and learns their corresponding character-based language model. Another, more expressive, approach operates directly on the molecular graph. In this work, we address two limitations of the former: generation of invalid and duplicate molecules. To improve validity rates, we develop a language model for small molecular substructures called fragments, loosely inspired by the well-known paradigm of Fragment-Based Drug Design. In other words, we generate molecules fragment by fragment, instead of atom by atom. To improve uniqueness rates, we present a frequency-based masking strategy that helps generate molecules with infrequent fragments. We show experimentally that our model largely outperforms other language model-based competitors, reaching state-of-the-art performances typical of graph-based approaches. Moreover, generated molecules display molecular properties similar to those in the training sample, even in absence of explicit task-specific supervision.
A Fair Comparison of Graph Neural Networks for Graph Classification
Errica, Federico, Podda, Marco, Bacciu, Davide, Micheli, Alessio
Experimental reproducibility and replicability is a critical topic in machine learning. Authors have often raised concerns about such scholarship issues, which are aimed at improving the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.