FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics

Duan, ChenRui, Zang, Zelin, Xu, Yongjie, He, Hang, Liu, Zihan, Song, Zijia, Zheng, Ju-Sheng, Li, Stan Z.

Feb-24-2024–arXiv.org Artificial Intelligence

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metagenomic sequences and their functions, we introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene group-level pre-training, providing insights into inter-gene contextual information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for gene-level pre-training to model gene sequence-function relationships. MGM and TEM-CL constitute our novel metagenomic language model {\NAME}, pre-trained on 100 million metagenomic sequences. We demonstrate the superiority of our proposed {\NAME} on eight datasets.

artificial intelligence, function-driven pre-trained gene language model, machine learning, (4 more...)

arXiv.org Artificial Intelligence

Feb-24-2024

arXiv.org Web Page

Add feedback

Genre:
- Research Report (0.40)

Industry:
- Energy > Oil & Gas (0.45)
- Health & Medicine (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.87)
  - Natural Language (0.60)