Sun, Siqi
Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence
Sun, Yingying, A, Jun, Liu, Zhiwei, Sun, Rui, Qian, Liujia, Payne, Samuel H., Bittremieux, Wout, Ralser, Markus, Li, Chen, Chen, Yi, Dong, Zhen, Perez-Riverol, Yasset, Khan, Asif, Sander, Chris, Aebersold, Ruedi, Vizcaíno, Juan Antonio, Krieger, Jonathan R, Yao, Jianhua, Wen, Han, Zhang, Linfeng, Zhu, Yunping, Xuan, Yue, Sun, Benjamin Boyang, Qiao, Liang, Hermjakob, Henning, Tang, Haixu, Gao, Huanhuan, Deng, Yamin, Zhong, Qing, Chang, Cheng, Bandeira, Nuno, Li, Ming, E, Weinan, Sun, Siqi, Yang, Yuedong, Omenn, Gilbert S., Zhang, Yue, Xu, Ping, Fu, Yan, Liu, Xiaowen, Overall, Christopher M., Wang, Yu, Deutsch, Eric W., Chen, Luonan, Cox, Jürgen, Demichev, Vadim, He, Fuchu, Huang, Jiaxing, Jin, Huilin, Liu, Chao, Li, Nan, Luan, Zhongzhi, Song, Jiangning, Yu, Kaicheng, Wan, Wanggen, Wang, Tai, Zhang, Kang, Zhang, Le, Bell, Peter A., Mann, Matthias, Zhang, Bing, Guo, Tiannan
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
Accurate RNA 3D structure prediction using a language model-based deep learning approach
Shen, Tao, Hu, Zhihang, Sun, Siqi, Liu, Di, Wong, Felix, Wang, Jiuming, Chen, Jiayang, Wang, Yixuan, Hong, Liang, Xiao, Jin, Zheng, Liangzhen, Krishnamoorthi, Tejas, King, Irwin, Wang, Sheng, Yin, Peng, Collins, James J., Li, Yu
Accurate prediction of RNA three-dimensional (3D) structure remains an unsolved challenge. Determining RNA 3D structures is crucial for understanding their functions and informing RNA-targeting drug development and synthetic biology design. The structural flexibility of RNA, which leads to scarcity of experimentally determined data, complicates computational prediction efforts. Here, we present RhoFold+, an RNA language model-based deep learning method that accurately predicts 3D structures of single-chain RNAs from sequences. By integrating an RNA language model pre-trained on ~23.7 million RNA sequences and leveraging techniques to address data scarcity, RhoFold+ offers a fully automated end-to-end pipeline for RNA 3D structure prediction. Retrospective evaluations on RNA-Puzzles and CASP15 natural RNA targets demonstrate RhoFold+'s superiority over existing methods, including human expert groups. Its efficacy and generalizability are further validated through cross-family and cross-type assessments, as well as time-censored benchmarks. Additionally, RhoFold+ predicts RNA secondary structures and inter-helical angles, providing empirically verifiable features that broaden its applicability to RNA structure and function studies.
COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models
Ren, Yuchen, Han, Wenwei, Zhang, Qianyuan, Tang, Yining, Bai, Weiqiang, Cai, Yuchen, Qiao, Lifeng, Jiang, Hao, Yuan, Dong, Chen, Tao, Sun, Siqi, Tan, Pan, Ouyang, Wanli, Dong, Nanqing, Ma, Xinzhu, Ye, Peng
As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large language models-poses challenges for researchers in choosing the most suitable models for specific tasks, especially for cross-omics and multi-omics tasks due to the lack of comprehensive benchmarks. To address this, we introduce the first comprehensive multi-omics benchmark COMET (Benchmark for Biological COmprehensive Multi-omics Evaluation Tasks and Language Models), designed to evaluate models across single-omics, cross-omics, and multi-omics tasks. First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins, including tasks that span multiple omics levels. Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method, offering valuable insights into their performance in integrating and analyzing data from different biological modalities. This benchmark aims to define critical issues in multi-omics research and guide future directions, ultimately promoting advancements in understanding biological processes through integrated and different omics data analysis.
Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model
Zhou, Peng, Wang, Jianmin, Li, Chunyan, Wang, Zixu, Liu, Yiping, Sun, Siqi, Lin, Jianxin, Wei, Leyi, Cai, Xibao, Lai, Houtim, Liu, Wei, Wang, Longyue, Zeng, Xiangxiang
While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 82.58%, 68.03%, and 67.48%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science.
Safeguarding Large Language Models: A Survey
Dong, Yi, Mu, Ronghui, Zhang, Yanghao, Sun, Siqi, Zhang, Tianle, Wu, Changshun, Jin, Gaojie, Qi, Yi, Hu, Jinwei, Meng, Jie, Bensalem, Saddek, Huang, Xiaowei
In the burgeoning field of Large Language Models (LLMs), developing a robust safety mechanism, colloquially known as "safeguards" or "guardrails", has become imperative to ensure the ethical use of LLMs within prescribed boundaries. This article provides a systematic literature review on the current status of this critical mechanism. It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts. First, the paper elucidates the current landscape of safeguarding mechanisms that major LLM service providers and the open-source community employ. This is followed by the techniques to evaluate, analyze, and enhance some (un)desirable properties that a guardrail might want to enforce, such as hallucinations, fairness, privacy, and so on. Based on them, we review techniques to circumvent these controls (i.e., attacks), to defend the attacks, and to reinforce the guardrails. While the techniques mentioned above represent the current status and the active research trends, we also discuss several challenges that cannot be easily dealt with by the methods and present our vision on how to implement a comprehensive guardrail through the full consideration of multi-disciplinary approach, neural-symbolic method, and systems development lifecycle.
Design, Actuation, and Functionalization of Untethered Soft Magnetic Robots with Life-Like Motions: A Review
Miao, Jiaqi, Sun, Siqi
Soft robots have demonstrated superior flexibility and functionality than conventional rigid robots. These versatile devices can respond to a wide range of external stimuli (including light, magnetic field, heat, electric field, etc.), and can perform sophisticated tasks. Notably, soft magnetic robots exhibit unparalleled advantages over numerous soft robots (such as untethered control, rapid response, and high safety), and have made remarkable progress in small-scale manipulation tasks and biomedical applications. Despite the promising potential, soft magnetic robots are still in their infancy and require significant advancements in terms of fabrication, design principles, and functional development to be viable for real-world applications. Recent progress shows that bionics can serve as an effective tool for developing soft robots. In light of this, the review is presented with two main goals: (i) exploring how innovative bioinspired strategies can revolutionize the design and actuation of soft magnetic robots to realize various life-like motions; (ii) examining how these bionic systems could benefit practical applications in small-scale solid/liquid manipulation and therapeutic/diagnostic-related biomedical fields.
ContraNovo: A Contrastive Learning Approach to Enhance De Novo Peptide Sequencing
Jin, Zhi, Xu, Sheng, Zhang, Xiang, Ling, Tianze, Dong, Nanqing, Ouyang, Wanli, Gao, Zhiqiang, Chang, Cheng, Sun, Siqi
De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learning-based methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides. In our research, we present ContraNovo, a pioneering algorithm that leverages contrastive learning to extract the relationship between spectra and peptides and incorporates the mass information into peptide decoding, aiming to address these intricacies more efficiently. Through rigorous evaluations on two benchmark datasets, ContraNovo consistently outshines contemporary state-of-the-art solutions, underscoring its promising potential in enhancing de novo peptide sequencing.
Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation
Zhang, Le, Chen, Jiayang, Shen, Tao, Li, Yu, Sun, Siqi
The field of protein folding research has been greatly advanced by deep learning methods, with AlphaFold2 (AF2) demonstrating exceptional performance and atomic-level precision. As co-evolution is integral to protein structure prediction, AF2's accuracy is significantly influenced by the depth of multiple sequence alignment (MSA), which requires extensive exploration of a large protein database for similar sequences. However, not all protein sequences possess abundant homologous families, and consequently, AF2's performance can degrade on such queries, at times failing to produce meaningful results. To address this, we introduce a novel generative language model, MSA-Augmenter, which leverages protein-specific attention mechanisms and large-scale MSAs to generate useful, novel protein sequences not currently found in databases. These sequences supplement shallow MSAs, enhancing the accuracy of structural property predictions. Our experiments on CASP14 demonstrate that MSA-Augmenter can generate de novo sequences that retain co-evolutionary information from inferior MSAs, thereby improving protein structure prediction quality on top of strong AF2.
AF2-Mutation: Adversarial Sequence Mutations against AlphaFold2 on Protein Tertiary Structure Prediction
Yuan, Zhongju, Shen, Tao, Xu, Sheng, Yu, Leiye, Ren, Ruobing, Sun, Siqi
Deep learning-based approaches, such as AlphaFold2 (AF2), have significantly advanced protein tertiary structure prediction, achieving results comparable to real biological experimental methods. While AF2 has shown limitations in predicting the effects of mutations, its robustness against sequence mutations remains to be determined. Starting with the wild-type (WT) sequence, we investigate adversarial sequences generated via an evolutionary approach, which AF2 predicts to be substantially different from WT. Our experiments on CASP14 reveal that by modifying merely three residues in the protein sequence using a combination of replacement, deletion, and insertion strategies, the alteration in AF2's predictions, as measured by the Local Distance Difference Test (lDDT), reaches 46.61. Moreover, when applied to a specific protein, SPNS2, our proposed algorithm successfully identifies biologically meaningful residues critical to protein structure determination and potentially indicates alternative conformations, thus significantly expediting the experimental process.
Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention
Xu, Yichong, Zhu, Chenguang, Wang, Shuohang, Sun, Siqi, Cheng, Hao, Liu, Xiaodong, Gao, Jianfeng, He, Pengcheng, Zeng, Michael, Huang, Xuedong
Most of today's AI systems focus on using self-attention mechanisms and transformer architectures on large amounts of diverse data to achieve impressive performance gains. In this paper, we propose to augment the transformer architecture with an external attention mechanism to bring external knowledge and context to bear. By integrating external information into the prediction process, we hope to reduce the need for ever-larger models and increase the democratization of AI systems. We find that the proposed external attention mechanism can significantly improve the performance of existing AI systems, allowing practitioners to easily customize foundation AI models to many diverse downstream applications. In particular, we focus on the task of Commonsense Reasoning, demonstrating that the proposed external attention mechanism can augment existing transformer models and significantly improve the model's reasoning capabilities. The proposed system, Knowledgeable External Attention for commonsense Reasoning (KEAR), reaches human parity on the open CommonsenseQA research benchmark with an accuracy of 89.4\% in comparison to the human accuracy of 88.9\%.