Can Smaller Large Language Models Evaluate Research Quality?

Aug-12-2025–arXiv.org Artificial Intelligence

Research evaluation is a common and important task for academics and managers, and it is often supported by citation - based indicators (Hicks et al., 2015; Moed, 2005; Mukherjee, 2022). With the increasingly widespread use of Artificial Intelligence (AI) in research ( Mohammadi et al., 2025), it is important to check whether it can save expert time through support of the research evaluation task. ChatGPT research quality score estimates for journal articles are recent alternative s to citations as quantitative indicator s to support evaluations ( Kousha & Thelwall, 2025) . Their value lies in their positive correlation with expert judgement in all or nearly all fields, and at a slightly higher rate than for citation - based indicators ( Thelwall, 2025abc). Despite some systematic biases or disparities ( Thelwall & Kurt, 2025), t his property means that they are helpful when expert judgement fails, such as fo r areas outside of the assessor's expertise, as a cross - check for bias, and for evaluations where assessment expertise is unavailable or too expensive for the value of the task (Thelwall, 2025d) . Whilst a positive correlation with expert judgement has been established for three of the largest Large Language Models (LLMs) in 2025, ChatGPT 4o, ChatGPT 4o - mini, and Google Gemini Flash 1.5 ( Thelwall, 2025ac), these are all cloud - based services and may be too expensive or not private enough for some research evaluation purposes ( Nowak et al., 2025) . Moreover, cloud - based services can be withdrawn, updated, or made more costly, so research evaluation procedures may not be able to rely on them. Thus, there is a need to test whether any smaller "open weights" LLMs ( Sowe et al., 2024) that can be downloaded and used offline have a capability to estimate research quality.

correlation, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

Aug-12-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Netherlands (0.28)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (0.69)
  - Therapeutic Area (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found