Synthetic Data from Diffusion Models Improve Drug Discovery Prediction
Hu, Bing, Saragadam, Ashish, Layton, Anita, Chen, Helen
–arXiv.org Artificial Intelligence
There is a growing trend towards leveraging artificial intelligence (AI) in every stage of drug development [1]. Drug development is an expensive process: it costs $2-3 billion dollars and 13-15 years to bring a single drug to market. Drug discovery AI, by enabling the high-throughput screening (HTS) of ligand candidates, is geared to reduce the developmental costs of drugs by revolutionizing how ligands are designed and tested [2]. Drug development AI has found great initial success such as in poly-pharmacy [3], drug re-purposing [4, 5], drug-target interaction [6], drug response prediction [7], and in search of new antibiotics [8]. Equally important to advances in AI for drug discovery are the equal improvements in available public data for training and testing these models [9, 10, 11]. Only through equal strides in the development and refinement of drug discovery data, and the application of advanced AI models to that data, do breakthroughs happen for AI-based methods for drug discovery. Huang et al. [9] noted 3 key challenges for drug discovery data to attracting ML scientists to therapeutics: (1) a lack of AI-ready datasets and standardized knowledge representations; (2) datasets scattered around the bio-repositories without curation; (3) a lack of data focused for rare diseases and novel drugs in development. We posit another data challenge that slows the advancement of drug discovery AI: datasets are often collected independently, often with little overlap, creating data sparsity. Data sparsity poses difficulties for researchers looking to answer research questions requiring data values posed across multiple different datasets.
arXiv.org Artificial Intelligence
May-6-2024