Large Language Models as Evaluators for Scientific Synthesis
Evans, Julia, D'Souza, Jennifer, Auer, Sören
–arXiv.org Artificial Intelligence
Our study explores how well the state-of-the-art Large Language Models (LLMs), like GPT-4 and Mistral, can assess the quality of scientific summaries or, more fittingly, scientific syntheses, comparing their evaluations to those of human annotators. We used a dataset of 100 research questions and their syntheses made by GPT-4 from abstracts of five related papers, checked against human quality ratings. The study evaluates both the closed-source GPT-4 and the open-source Mistral model's ability to rate these summaries and provide reasons for their judgments. Preliminary results show that LLMs can offer logical explanations that somewhat match the quality ratings, yet a deeper statistical analysis shows a weak correlation between LLM and human ratings, suggesting the potential and current limitations of LLMs in scientific synthesis evaluation.
arXiv.org Artificial Intelligence
Jul-3-2024
- Country:
- North America
- United States
- Pennsylvania (0.04)
- Michigan (0.04)
- New York > New York County
- New York City (0.04)
- Canada > Ontario
- Toronto (0.04)
- United States
- Europe
- Switzerland (0.04)
- Slovenia (0.04)
- Sweden > Uppsala County
- Uppsala (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Germany > Lower Saxony
- Hanover (0.04)
- Asia
- North America
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (1.00)
- Materials > Chemicals (0.70)
- Energy (0.70)
- Government (0.68)
- Health & Medicine > Therapeutic Area
- Psychiatry/Psychology (0.47)
- Technology: