Efficient and Scalable Fine-Tune of Language Models for Genome Understanding

Zhan, Huixin, Wu, Ying Nian, Zhang, Zijun

arXiv.org Artificial Intelligence 

Although DNA foundation models have advanced the understanding of genomes, they still face significant challenges in the limited scale and diversity of genomic data. This limitation starkly contrasts with the success of natural language foundation models, which thrive on substantially larger scales. Furthermore, genome understanding involves numerous downstream genome annotation tasks with inherent data heterogeneity, thereby necessitating more efficient and robust finetuning methods tailored for genomics. Lingo further accommodates numerous, heterogeneous downstream fine-tune tasks by an adaptive rank sampling method that prunes and stochastically reintroduces pruned singular vectors within small computational budgets. Adaptive rank sampling outperformed existing fine-tuning methods on all benchmarked 14 genome understanding tasks, while requiring fewer than 2% of trainable parameters as genomic-specific adapters. Impressively, applying these adapters on natural language foundation models matched or even exceeded the performance of DNA foundation models. Lingo presents a new paradigm of efficient and scalable genome understanding via genomic-specific adapters on language models. DNA foundation models, such as DNABERT [1], DNABERT-2 [2], and Nucleotide Transformer (NT) [3], have made significant progress in decoding the linguistic intricacies of the genome. An important paradigm of utilizing such DNA foundation models is "pre-training+finetuning", i.e., pre-training on unlabeled genomic sequences, and then adaptation to a particular genome understanding task. A critical aspect of genome annotation and downstream tasks is their considerable number and diversity. For example, state-of-the-art deep learning models in epigenetics alone can encompass nearly 22,000 individual tasks [4].