dnaGrinder: a lightweight and high-capacity genomic foundation model
Zhao, Qihang, Zhang, Chi, Zhang, Weixiong
–arXiv.org Artificial Intelligence
Foundation models (aka large language models) such as BERT [1] and GPT [2], have demonstrated their stellar performance in learning the complex characteristics and structures of natural languages, making them well-suited for a variety of subsequent applications, such as sentiment analysis, text generation, and translation [3]. These foundation models have recently been adapted to analyze biological sequences as their deep structure and large-scale parameters are well suited for dealing with the intricacy of biological sequences and structures [4, 5, 6, 7, 8, 9, 10, 11]. Biological sequences composed of nucleotides like DNA and RNA, as well as amino acids forming peptides and proteins, are regarded as natural languages of life and can be effectively leveraged by using the technology of foundation models to uncover the underlying patterns and functions they encode [12]. Typically, these foundation models build robust feature representations from biological sequences through a process known as pretraining. Encoder-based models like BERT perform such pretraining by using a method called Masked Language Modeling (MLM), where they predict the actual words of some masked or corrupted ones in given sequences. By pretraining on millions of biological sequences, foundation models gain a comprehensive contextual understanding of the given sequences. Once trained, they only need a few fine-tuning steps to be effectively applicable to specific downstream tasks [13], including prediction of epigenetic marks, gene expressions, protein folding structures, and more.
arXiv.org Artificial Intelligence
Sep-23-2024
- Country:
- Asia > China (0.28)
- Europe (0.46)
- North America > United States
- Minnesota > Hennepin County > Minneapolis (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Technology: