BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

Pei, Qizhi, Zhang, Wei, Zhu, Jinhua, Wu, Kehan, Gao, Kaiyuan, Wu, Lijun, Xia, Yingce, Yan, Rui

Jan-28-2024–arXiv.org Artificial Intelligence

Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.

dataset, molecule, protein, (16 more...)

arXiv.org Artificial Intelligence

Jan-28-2024

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Michigan > Washtenaw County
      - Ann Arbor (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - California > Santa Clara County
      - Palo Alto (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - France (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area
    - Infections and Infectious Diseases (1.00)
    - Immunology (0.69)
- Government > Regional Government
  - North America Government > United States Government > FDA (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Text Processing (1.00)
    - Large Language Model (1.00)
    - Chatbot (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)