TheoremQA: A Theorem-driven Question Answering dataset

Chen, Wenhu, Yin, Ming, Ku, Max, Lu, Pan, Wan, Yixin, Ma, Xueguang, Xu, Jianyu, Wang, Xinyi, Xia, Tony

Dec-5-2023–arXiv.org Artificial Intelligence

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

dataset, language model, theorem, (15 more...)

arXiv.org Artificial Intelligence

Dec-5-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Illinois > Champaign County
      - Champaign (0.04)
    - California
      - Los Angeles County > Los Angeles (0.28)
      - Santa Barbara County > Santa Barbara (0.14)
  - Canada > Ontario
    - Waterloo Region > Waterloo (0.04)
- Europe
  - Spain > Valencian Community
    - Valencia Province > Valencia (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
- Asia > Middle East
  - Jordan (0.04)
  - UAE (0.04)

Genre:
- Research Report (0.64)

Industry:
- Education > Educational Setting (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)