dnaGrinder: a lightweight and high-capacity genomic foundation model

Zhao, Qihang, Zhang, Chi, Zhang, Weixiong

Sep-23-2024–arXiv.org Artificial Intelligence

Foundation models (aka large language models) such as BERT [1] and GPT [2], have demonstrated their stellar performance in learning the complex characteristics and structures of natural languages, making them well-suited for a variety of subsequent applications, such as sentiment analysis, text generation, and translation [3]. These foundation models have recently been adapted to analyze biological sequences as their deep structure and large-scale parameters are well suited for dealing with the intricacy of biological sequences and structures [4, 5, 6, 7, 8, 9, 10, 11]. Biological sequences composed of nucleotides like DNA and RNA, as well as amino acids forming peptides and proteins, are regarded as natural languages of life and can be effectively leveraged by using the technology of foundation models to uncover the underlying patterns and functions they encode [12]. Typically, these foundation models build robust feature representations from biological sequences through a process known as pretraining. Encoder-based models like BERT perform such pretraining by using a method called Masked Language Modeling (MLM), where they predict the actual words of some masked or corrupted ones in given sequences. By pretraining on millions of biological sequences, foundation models gain a comprehensive contextual understanding of the given sequences. Once trained, they only need a few fine-tuning steps to be effectively applicable to specific downstream tasks [13], including prediction of epigenetic marks, gene expressions, protein folding structures, and more.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Sep-23-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)
- Europe (0.46)
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Natural Language > Large Language Model (0.87)