Incorporating LLM Embeddings for Variation Across the Human Genome
Niu, Hongqian, Bryan, Jordan, Li, Xihao, Li, Didong
–arXiv.org Artificial Intelligence
In the past few years, foundation models based on large transformer networks such as Google's BERT (Kenton and Toutanova, 2019) and OpenAI's GPT family (Radford, 2018) have been shown to be invaluable aids for scientific discovery in the analysis of genomic data (Cui et al., 2024; Theodoris et al., 2023; Chen and Zou, 2025). More specifically, foundation models targeted for genomic applications typically comprise of those that are trained on enormous databases of experimental data such as scGPT (Cui et al., 2024), which was trained on transcriptomes from 33 million human cells from 441 different studies or the GeneFormer model (Theodoris et al., 2023), which was trained on 29.9 million human single-cell transcriptomes. On the other hand, foundation models based on pre-training on internet-scale databases of natural language texts may offer distinct advantages, such as potentially taking advantage of niche biological relationships which may be widely documented in scientific literature, but not necessarily be represented experimentally in large-scale genomics datasets. For this reason, some recent works have used the embedding outputs of large-language models (LLMs) such as ChatGPT (Radford, 2018) to encode the biological information contained in text-based gene descriptions, such as those in the NCBI database (Schoch et al., 2020). Notably, Chen and Zou (2025) show that these text-based gene descriptors can be input to GPT-3.5 to obtain gene embeddings that act as features/covariates for standard prediction algorithms, denoted GenePT.
arXiv.org Artificial Intelligence
Sep-26-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe > United Kingdom (0.04)
- North America > United States
- Minnesota > Hennepin County
- Minneapolis (0.14)
- North Carolina (0.04)
- Virginia (0.04)
- Minnesota > Hennepin County
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Industry:
- Technology: