Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

Aug-14-2023–arXiv.org Artificial Intelligence

Our test sets were too small and too haphazard to support statistically valid conclusions, but they were suggestive of a number of conclusions. We summarize these here, and discuss them at greater length in section 7. Over the kinds of problems tested, GPT-4 with either plug-in is significantly stronger than GPT-4 by itself, or, almost certainly, than any AI that existed a year ago. However it is still far from reliable; it often outputs a wrong answer or fails to output any answer. In terms of overall score, we would judge that these systems performs on the level of a middling undergraduate student. However, their capacities and weaknesses do not align with a human student; the systems solve some problems that even capable students would find challenging, whereas they fail on some problems that even middling high school students would find easy.

calculation, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Aug-14-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - Canada (0.67)
  - United States
    - California (0.67)
    - Illinois (0.46)
    - Texas (0.68)

Genre:
- Research Report (0.41)

Industry:
- Education > Educational Setting > K-12 Education > Secondary School (0.54)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found