chemical language
Scientific Large Language Models: A Survey on Biological & Chemical Domains
Zhang, Qiang, Ding, Keyang, Lyv, Tianwen, Wang, Xinda, Yin, Qingyu, Zhang, Yiwen, Yu, Jing, Wang, Yuhao, Li, Xiaotong, Xiang, Zhuoyi, Zhuang, Xiang, Wang, Zeyuan, Qin, Ming, Zhang, Mengyao, Zhang, Jinlu, Cui, Jiyu, Xu, Renjun, Chen, Hongyang, Fan, Xiaohui, Xing, Huabin, Chen, Huajun
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia (0.04)
- (7 more...)
- Overview (1.00)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.45)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs
Hu, Xiuyuan, Liu, Guoqing, Zhao, Yang, Zhang, Hao
AI for drug discovery has been a research hotspot in recent years, and SMILES-based language models has been increasingly applied in drug molecular design. However, no work has explored whether and how language models understand the chemical spatial structure from 1D sequences. In this work, we pre-train a transformer model on chemical language and fine-tune it toward drug design objectives, and investigate the correspondence between high-frequency SMILES substrings and molecular fragments. The results indicate that language models can understand chemical structures from the perspective of molecular fragments, and the structural knowledge learned through fine-tuning is reflected in the high-frequency SMILES substrings generated by the model.
Interactive Molecular Discovery with Natural Language
Zeng, Zheni, Yin, Bangchen, Wang, Shipeng, Liu, Jiarui, Yang, Cheng, Yao, Haishen, Sun, Xingzhi, Sun, Maosong, Xie, Guotong, Liu, Zhiyuan
Natural language is expected to be a key medium for various human-machine interactions in the era of large language models. When it comes to the biochemistry field, a series of tasks around molecules (e.g., property prediction, molecule mining, etc.) are of great significance while having a high technical threshold. Bridging the molecule expressions in natural language and chemical language can not only hugely improve the interpretability and reduce the operation difficulty of these tasks, but also fuse the chemical knowledge scattered in complementary materials for a deeper comprehension of molecules. Based on these benefits, we propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules. To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information, molecular spatial knowledge, and the associations between natural and chemical languages into it. Several typical solutions including large language models (e.g., ChatGPT) are evaluated, proving the challenge of conversational molecular design and the effectiveness of our knowledge enhancement method. Case observations and analysis are conducted to provide directions for further exploration of natural-language interaction in molecular discovery.
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Materials > Chemicals (0.68)
PySMILESUtils – Enabling deep learning with the SMILES chemical language
Recent years have seen a large interest in using the Simplified Molecular Input Line Entry System (SMILES) chemical language as input for deep learning architectures solving chemical tasks. Many successful applications have been demonstrated within de novo molecular design, quantitative structure-activity relationship modelling, forward reaction prediction and single-step retrosynthetic planning as examples. PySMILESUtils aims to enable these tasks by providing readyto- use and adaptable Python classes for tokenization, augmentation, dataset, and dataloader creation. Classes for handling datasets larger than memory and speeding up training by minimizing padding are also provided. The framework subclasses PyTorch dataset and dataloaders but should be adaptable for other deep learning frameworks.