Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models
Säuberli, Andreas, Clematide, Simon
–arXiv.org Artificial Intelligence
Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.
arXiv.org Artificial Intelligence
May-20-2024
- Country:
- Asia
- Indonesia > Bali (0.04)
- Malaysia (0.04)
- Middle East
- Qatar (0.04)
- Saudi Arabia (0.04)
- UAE (0.04)
- Yemen (0.05)
- Europe
- Estonia > Tartu County
- Tartu (0.04)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Germany (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Estonia > Tartu County
- North America
- Canada > Ontario
- Toronto (0.04)
- United States > New York (0.04)
- Canada > Ontario
- Asia
- Genre:
- Research Report > New Finding (0.86)
- Industry:
- Education > Assessment & Standards > Student Performance (1.00)
- Technology: