Multi-view biomedical foundation models for molecule-target and property prediction
Suryanarayanan, Parthasarathy, Qiu, Yunguang, Sethi, Shreyans, Mahajan, Diwakar, Li, Hongyang, Yang, Yuxin, Eyigoz, Elif, Saenz, Aldo Guzman, Platt, Daniel E., Rumbell, Timothy H., Ng, Kenney, Dey, Sanjoy, Burch, Myson, Kwon, Bum Chul, Meyer, Pablo, Cheng, Feixiong, Hu, Jianying, Morrone, Joseph A.
–arXiv.org Artificial Intelligence
Drug discovery is a complex, multi-stage process. Lead identification and lead optimization remain costly with low success-rates and computational methods play an important role in accelerating these tasks [1-3]. The prediction of a broad range of chemical and biological properties of candidate molecules is an essential component of screening and assessing molecules and data-driven, machine learning approaches have long aided in this process [4-6]. Molecular representations form the basis of machine learning models [2, 7], facilitating algorithmic and scientific advances in the field. However, learning useful and generalized latent representation is a hard problem due to limited amounts of labeled data, wide ranges of downstream tasks, vast chemical space, and large heterogeneity in molecular structures. Learning latent representations using unsupervised techniques is vital for such models to scale. Large language models (LLMs) have revolutionized other fields [8] and similar sequence-based foundation models have shown promise to learn molecular representations and be trainable on many downstream property prediction tasks [9-11]. A key advantage is that the transformer based architecture can learn in a self-supervised fashion to create a "pre-trained" molecular representation. The most direct application of LLM like transformers is facilitated by a sequence, text-based representation (e.g.
arXiv.org Artificial Intelligence
Oct-25-2024