ChemX: ACollection of Chemistry Datasets for Benchmarking Automated Information Extraction

Jun-20-2026, 17:38:28 GMT–Neural Information Processing Systems

Despite recent advances in machine learning, many scientific discoveries in chemistry still rely on manually curated datasets extracted from the scientific literature. Automation of information extraction in specialized chemistry domains has the potential to scale up machine learning applications and improve the quality of predictions, enabling data-driven scientific discoveries at a faster pace. In this paper, we present ChemX, a collection of 10 benchmarking datasets across several domains of chemistry providing a reliable basis for evaluating and fine-tuning automated information extraction methods. The datasets encompassing various properties of small molecules and nanomaterials have been manually extracted from peer-reviewed publications and systematically validated by domain experts through a cross-verification procedure allowing for identification and correction of errors at sources. In order to demonstrate the utility of the resulting datasets, we evaluate the extraction performance of the state-of-the-art large language models (LLMs). Moreover, we design our own agentic approach to take full control of the document preprocessing before LLM-based information extraction.

data mining, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Jun-20-2026, 17:38:28 GMT

Conferences PDF

Add feedback

Country:
- Europe > Russia (0.28)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Energy (0.67)
- Materials > Chemicals
  - Commodity Chemicals > Petrochemicals (0.68)
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area > Infections and Infectious Diseases (0.68)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Text Mining (1.00)
  - Artificial Intelligence
    - Natural Language
      - Large Language Model (1.00)
      - Information Extraction (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found