Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems
Davis, Ernest, Aaronson, Scott
–arXiv.org Artificial Intelligence
Our test sets were too small and too haphazard to support statistically valid conclusions, but they were suggestive of a number of conclusions. We summarize these here, and discuss them at greater length in section 7. Over the kinds of problems tested, GPT-4 with either plug-in is significantly stronger than GPT-4 by itself, or, almost certainly, than any AI that existed a year ago. However it is still far from reliable; it often outputs a wrong answer or fails to output any answer. In terms of overall score, we would judge that these systems performs on the level of a middling undergraduate student. However, their capacities and weaknesses do not align with a human student; the systems solve some problems that even capable students would find challenging, whereas they fail on some problems that even middling high school students would find easy.
arXiv.org Artificial Intelligence
Aug-14-2023
- Country:
- North America
- Canada (0.67)
- United States
- California (0.67)
- Illinois (0.46)
- Texas (0.68)
- North America
- Genre:
- Research Report (0.41)
- Industry:
- Education > Educational Setting > K-12 Education > Secondary School (0.54)
- Technology: