Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification

Liu, Zicheng, Li, Siyuan, Chen, Zhiyuan, Xin, Lei, Wu, Fang, Yu, Chang, Yang, Qirong, Guo, Yucheng, Yang, Yujie, Li, Stan Z.

Feb-11-2025–arXiv.org Artificial Intelligence

The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. While modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains under-explored. In this paper, we follow the guidance of the central dogma to redesign both the data and model pipeline and offer a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions of both coding and non-coding regions with masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive Experiments show that Life-Code achieves state-of-the-art performance on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.

bioinformatics, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Feb-11-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)
- Asia > China
  - Zhejiang Province > Hangzhou (0.04)
  - Hong Kong (0.04)
  - Beijing > Beijing (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area
    - Pulmonary/Respiratory Diseases (1.00)
    - Infections and Infectious Diseases (1.00)
    - Immunology (0.93)

Technology:
- Information Technology
  - Biomedical Informatics > Translational Bioinformatics (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Machine Learning
      - Neural Networks > Deep Learning (1.00)
      - Performance Analysis > Accuracy (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found