L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

Edwards, Carl, Wang, Qingyun, Zhao, Lawrence, Ji, Heng

Jul-4-2024–arXiv.org Artificial Intelligence

Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.

inhibitor, molecule, preprint arxiv, (15 more...)

arXiv.org Artificial Intelligence

Jul-4-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - Illinois > Champaign County
      - Urbana (0.04)
- Asia > Middle East
  - UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre:
- Research Report (0.40)

Industry:
- Food & Agriculture > Agriculture (0.94)
- Energy (0.69)
- Materials > Chemicals (0.68)
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area
    - Neurology (0.94)
    - Oncology (0.68)
- Government > Regional Government
  - North America Government > United States Government (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.70)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found