ATG: Benchmarking Automated Theorem Generation for Generative Language Models

Lin, Xiaohan, Cao, Qingxing, Huang, Yinya, Yang, Zhicheng, Liu, Zhengying, Li, Zhenguo, Liang, Xiaodan

May-4-2024–arXiv.org Artificial Intelligence

Humans can develop new theorems to explore broader and more complex mathematical results. While current generative language models (LMs) have achieved significant improvement in automatically proving theorems, their ability to generate new or reusable theorems is still under-explored. Without the new theorems, current LMs struggle to prove harder theorems that are distant from the given hypotheses with the exponentially growing search space. Therefore, this paper proposes an Automated Theorem Generation (ATG) benchmark that evaluates whether an agent can automatically generate valuable (and possibly brand new) theorems that are applicable for downstream theorem proving as reusable knowledge. Specifically, we construct the ATG benchmark by Figure 1: An example theorem generated by GPT-4 splitting the Metamath library into three sets: (OpenAI, 2023). GPT-4 wrongly refers to the intermediate axioms, library, and problem based on their theorem (A (B A) A A) as proving depth. We conduct extensive experiments ((A B) (A A)). In Step 4, it applies "ax-to investigate whether current LMs can 1" but obtains the wrong expression instead of correct generate theorems in the library and benefit (A (B A)) and can not derive (A A) even the problem theorems proving. The results with the incorrect Steps 4 and 5. demonstrate that high-quality ATG data facilitates models' performances on downstream

dataset, hypothesis, theorem, (15 more...)

arXiv.org Artificial Intelligence

May-4-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
- Europe
  - France (0.04)
  - Austria (0.04)
  - Slovenia > Drava
    - Municipality of Benedikt > Benedikt (0.04)
- Asia
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.14)
  - China
    - Hong Kong (0.04)
    - Beijing > Beijing (0.04)
    - Guangdong Province
      - Shenzhen (0.04)
      - Guangzhou (0.04)
- Africa > Rwanda
  - Kigali > Kigali (0.04)

Genre:
- Workflow (0.55)
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning
    - Search (1.00)
    - Logic & Formal Reasoning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found