Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

Lucy, Li, August, Tal, Wang, Rose E., Soldaini, Luca, Allison, Courtney, Lo, Kyle

Aug-9-2024–arXiv.org Artificial Intelligence

Our work presents a novel angle for evaluating language models' (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or standards, from Achieve the Core (ATC), and another of 9.9K problems labeled with these standards (MathFish). Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.

problem activity, relation, student, (16 more...)

arXiv.org Artificial Intelligence

Aug-9-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Maryland (0.04)
    - Arizona (0.04)
    - New York > New York County
      - New York City (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Illinois > Cook County
      - Chicago (0.04)
    - California > Alameda County
      - Berkeley (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Netherlands (0.04)
  - Monaco (0.04)
  - Greece > Crete
    - Chania (0.04)
- Asia
  - Singapore (0.04)
  - Middle East > Jordan (0.04)

Genre:
- Instructional Material (1.00)
- Research Report > Experimental Study (0.46)

Industry:
- Education
  - Curriculum > Subject-Specific Education (0.46)
  - Educational Setting > K-12 Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.72)
  - Machine Learning > Neural Networks
    - Deep Learning (0.72)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found