Lai, Houtim
Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model
Zhou, Peng, Wang, Jianmin, Li, Chunyan, Wang, Zixu, Liu, Yiping, Sun, Siqi, Lin, Jianxin, Wei, Leyi, Cai, Xibao, Lai, Houtim, Liu, Wei, Wang, Longyue, Zeng, Xiangxiang
While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 82.58%, 68.03%, and 67.48%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science.
DrugAssist: A Large Language Model for Molecule Optimization
Ye, Geyan, Cai, Xibao, Lai, Houtim, Wang, Xing, Huang, Junhong, Wang, Longyue, Liu, Wei, Zeng, Xiangxiang
Recently, the impressive performance of large language models (LLMs) on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback. These non-interactive approaches overlook the fact that the drug discovery process is actually one that requires the integration of expert experience and iterative refinement. To address this gap, we propose DrugAssist, an interactive molecule optimization model which performs optimization through humanmachine dialogue by leveraging LLM's strong interactivity and generalizability. DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization. In addition, we publicly release a large instructionbased dataset called "MolOpt-Instructions" for fine-tuning language models on molecule optimization tasks. Figure 1: The illustration of our proposed DrugAssist model framework, which focus on optimizing molecules through human-machine dialogue. Recently, generative artificial intelligence has made remarkable strides in the field of natural language processing (NLP), particularly with the advent of Large Language Models (LLMs) such as GPT (Generative Pre-trained Transformer) (Radford et al., 2019). These models have demonstrated impressive capabilities in a wide range of tasks, extending far beyond everyday communication and question-answering scenarios.
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations
Ji, Yuanfeng, Zhang, Lu, Wu, Jiaxiang, Wu, Bingzhe, Huang, Long-Kai, Xu, Tingyang, Rong, Yu, Li, Lanqing, Ren, Jie, Xue, Ding, Lai, Houtim, Xu, Shaoyong, Feng, Jing, Liu, Wei, Luo, Ping, Zhou, Shuigeng, Huang, Junzhou, Zhao, Peilin, Bian, Yatao
AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with \emph{noise}, which is inevitable in real world AIDD applications. In this work, we present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery, which comes with an open-source Python package that fully automates the data curation and OOD benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction, which involves both macromolecule (protein target) and small-molecule (drug compound). In contrast to only providing fixed datasets, DrugOOD offers automated dataset curator with user-friendly customization scripts, rich domain annotations aligned with biochemistry knowledge, realistic noise annotations and rigorous benchmarking of state-of-the-art OOD algorithms. Since the molecular data is often modeled as irregular graphs using graph neural network (GNN) backbones, DrugOOD also serves as a valuable testbed for \emph{graph OOD learning} problems. Extensive empirical studies have shown a significant performance gap between in-distribution and out-of-distribution experiments, which highlights the need to develop better schemes that can allow for OOD generalization under noise for AIDD.