ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios

Wei, Shou'ang, Wang, Xinyun, Bi, Shuzhen, Chen, Jian, Li, Ruijia, Jiang, Bo, Lin, Xin, Zhang, Min, Song, Yu, Li, BingDong, Zhou, Aimin, Hao, Hao

Aug-1-2025–arXiv.org Artificial Intelligence

The emergence of Large Language Models (LLMs) presents transformative opportunities for education, generating numerous novel application scenarios. However, significant challenges remain: evaluation metrics vary substantially across di ff erent educational scenarios, while many emerging scenarios lack appropriate assessment metrics. To address this gap, we introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings. ELMES features a modular architecture that enables researchers to create dynamic, multi-agent dialogues through simple configuration files, facilitating flexible scenario design without requiring extensive programming expertise. The framework incorporates a hybrid evaluation engine that objectively quantifies traditionally subjective pedagogical metrics using an LLM-as-a-Judge methodology. We conduct systematic benchmarking of state-of-the-art LLMs across four critical educational scenarios: Knowledge Point Explanation, Guided Problem-Solving Teaching, Interdisciplinary Lesson Plan Generation, and Contextualized Question Generation, employing fine-grained metrics developed in collaboration with education specialists. Our results demonstrate distinct capability distributions among models, revealing context-specific strengths and limitations. ELMES provides educators and researchers with an accessible evaluation framework that significantly reduces adaptation barriers for diverse educational applications while advancing the practical implementation of LLMs in pedagogy. Introduction The advent of Large Language Models (LLMs) is reshap-ing the educational paradigm with unprecedented potential [1]. Their powerful capabilities in natural language understanding and generation have paved new ways for intelligent teaching and learning. Consequently, researchers are actively exploring various avenues to leverage LLMs for educational empowerment.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Aug-1-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.29)

Genre:
- Instructional Material (1.00)
- Research Report > New Finding (0.68)

Industry:
- Education
  - Educational Setting (1.00)
  - Curriculum (0.90)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found