thelwall
Reducing research bureaucracy in UK higher education: Can generative AI assist with the internal evaluation of quality?
Fletcher, Gordon, Khan, Saomai Vu, Fletcher, Aldus Greenhill
This paper examines the potential for generative artificial intelligence (GenAI) to assist with internal review processes for research quality evaluations in UK higher education and particularly in preparation for the Research Excellence Framework (REF). Using the lens of function substitution in the Viable Systems Model, we present an experimental methodology using ChatGPT to score and rank business and management papers from REF 2021 submissions, "reverse engineering" the assessment by comparing AI-generated scores with known institutional results. Through rigourous testing of 822 papers across 11 institutions, we established scoring boundaries that aligned with reported REF outcomes: 49% between 1* and 2*, 59% between 2* and 3*, and 69% between 3* and 4*. The results demonstrate that AI can provide consistent evaluations that help identify borderline evaluation cases requiring additional human scrutiny while reducing the substantial resource burden of traditional internal review processes. We argue for application through a nuanced hybrid approach that maintains academic integrity while addressing the multi-million pound costs associated with research evaluation bureaucracy. While acknowledging these limitations including potential AI biases, the research presents a promising framework for more efficient, consistent evaluations that could transform current approaches to research assessment.
- Information Technology > Security & Privacy (0.46)
- Education > Educational Setting (0.35)
Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?
Thelwall, Mike, Mohammadi, Ehsan
Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.
- North America > United States > South Carolina > Richland County > Columbia (0.04)
- Europe > United Kingdom > England > South Yorkshire > Sheffield (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Armenia > Yerevan > Yerevan (0.04)
Can Smaller Large Language Models Evaluate Research Quality?
Research evaluation is a common and important task for academics and managers, and it is often supported by citation - based indicators (Hicks et al., 2015; Moed, 2005; Mukherjee, 2022). With the increasingly widespread use of Artificial Intelligence (AI) in research ( Mohammadi et al., 2025), it is important to check whether it can save expert time through support of the research evaluation task. ChatGPT research quality score estimates for journal articles are recent alternative s to citations as quantitative indicator s to support evaluations ( Kousha & Thelwall, 2025) . Their value lies in their positive correlation with expert judgement in all or nearly all fields, and at a slightly higher rate than for citation - based indicators ( Thelwall, 2025abc). Despite some systematic biases or disparities ( Thelwall & Kurt, 2025), t his property means that they are helpful when expert judgement fails, such as fo r areas outside of the assessor's expertise, as a cross - check for bias, and for evaluations where assessment expertise is unavailable or too expensive for the value of the task (Thelwall, 2025d) . Whilst a positive correlation with expert judgement has been established for three of the largest Large Language Models (LLMs) in 2025, ChatGPT 4o, ChatGPT 4o - mini, and Google Gemini Flash 1.5 ( Thelwall, 2025ac), these are all cloud - based services and may be too expensive or not private enough for some research evaluation purposes ( Nowak et al., 2025) . Moreover, cloud - based services can be withdrawn, updated, or made more costly, so research evaluation procedures may not be able to rely on them. Thus, there is a need to test whether any smaller "open weights" LLMs ( Sowe et al., 2024) that can be downloaded and used offline have a capability to estimate research quality.
- Europe > United Kingdom (0.14)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- (2 more...)
Generative AI and the future of scientometrics: current topics and future questions
Lepori, Benedetto, Andersen, Jens Peter, Donnay, Karsten
The aim of this paper is to review the use of GenAI in scientometrics, and to begin a debate on the broader implications for the field. First, we provide an introduction on GenAI's generative and probabilistic nature as rooted in distributional linguistics. And we relate this to the debate on the extent to which GenAI might be able to mimic human 'reasoning'. Second, we leverage this distinction for a critical engagement with recent experiments using GenAI in scientometrics, including topic labelling, the analysis of citation contexts, predictive applications, scholars' profiling, and research assessment. GenAI shows promise in tasks where language generation dominates, such as labelling, but faces limitations in tasks that require stable semantics, pragmatic reasoning, or structured domain knowledge. However, these results might become quickly outdated. Our recommendation is, therefore, to always strive to systematically compare the performance of different GenAI models for specific tasks. Third, we inquire whether, by generating large amounts of scientific language, GenAI might have a fundamental impact on our field by affecting textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production.
- North America > United States (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
- Overview (1.00)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.66)
Evaluating the Predictive Capacity of ChatGPT for Academic Peer Review Outcomes Across Multiple Platforms
Thelwall, Mike, Yaghi, Abdullah
While previous studies have demonstrated that Large Language Models (LLMs) can predict peer review outcomes to some extent, this paper builds on that by introducing two new contexts and employing a more robust method - averaging multiple ChatGPT scores. The findings that averaging 30 ChatGPT predictions, based on reviewer guidelines and using only the submitted titles and abstracts, failed to predict peer review outcomes for F1000Research (Spearman's rho=0.00). However, it produced mostly weak positive correlations with the quality dimensions of SciPost Physics (rho=0.25 for validity, rho=0.25 for originality, rho=0.20 for significance, and rho = 0.08 for clarity) and a moderate positive correlation for papers from the International Conference on Learning Representations (ICLR) (rho=0.38). Including the full text of articles significantly increased the correlation for ICLR (rho=0.46) and slightly improved it for F1000Research (rho=0.09), while it had variable effects on the four quality dimension correlations for SciPost LaTeX files. The use of chain-of-thought system prompts slightly increased the correlation for F1000Research (rho=0.10), marginally reduced it for ICLR (rho=0.37), and further decreased it for SciPost Physics (rho=0.16 for validity, rho=0.18 for originality, rho=0.18 for significance, and rho=0.05 for clarity). Overall, the results suggest that in some contexts, ChatGPT can produce weak pre-publication quality assessments. However, the effectiveness of these assessments and the optimal strategies for employing them vary considerably across different platforms, journals, and conferences. Additionally, the most suitable inputs for ChatGPT appear to differ depending on the platform.
Evaluating the quality of published medical research with ChatGPT
Thelwall, Mike, Jiang, Xiaorui, Bath, Peter A.
Research quality evaluation is important for departmental evaluations and academic career decisions. Unfortunately, the evaluators may not have time to fully read the work assessed and may instead rely on the reputation or Journal Impact Factor of the publishing journals, on the citation counts for individual articles, or on the reputation or career citations of the author. Whilst journal-based evidence is not optimal (Waltman & Traag, 2021), the main article-level indicator, citation counts, only directly reflects the scholarly impact of work and not its rigour, originality, and societal impacts (Aksnes, et al., 2019), all of which are relevant quality dimensions (Langfeldt et al., 2020). Moreover, article citation counts are ineffective for newer articles (Wang, 2013). In response, attempts to use Large Language Models (LLMs) to evaluate the quality of academic work have shown that ChatGPT quality scores are at least as effective as citation counts in most fields and substantially better in a few (Thelwall & Yaghi, 2024). Medicine is an exception, however, with ChatGPT research quality scores having a small negative correlation with the mean scores of the submitting department in the Research Excellence Framework (REF) Clinical Medicine Unit of Assessment (UoA) (Thelwall, 2024ab; Thelwall & Yaghi, 2024).
- Europe > United Kingdom > England > Leicestershire > Leicester (0.05)
- Europe > United Kingdom > England > South Yorkshire > Sheffield (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Marin County > San Rafael (0.04)
- Research Report > Strength High (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations
Kousha, Kayvan, Thelwall, Mike
Academics and departments are sometimes judged by how their research has benefitted society. For example, the UK Research Excellence Framework (REF) assesses Impact Case Studies (ICS), which are five-page evidence-based claims of societal impacts. This study investigates whether ChatGPT can evaluate societal impact claims and therefore potentially support expert human assessors. For this, various parts of 6,220 public ICS from REF2021 were fed to ChatGPT 4o-mini along with the REF2021 evaluation guidelines, comparing the results with published departmental average ICS scores. The results suggest that the optimal strategy for high correlations with expert scores is to input the title and summary of an ICS but not the remaining text, and to modify the original REF guidelines to encourage a stricter evaluation. The scores generated by this approach correlated positively with departmental average scores in all 34 Units of Assessment (UoAs), with values between 0.18 (Economics and Econometrics) and 0.56 (Psychology, Psychiatry and Neuroscience). At the departmental level, the corresponding correlations were higher, reaching 0.71 for Sport and Exercise Sciences, Leisure and Tourism. Thus, ChatGPT-based ICS evaluations are simple and viable to support or cross-check expert judgments, although their value varies substantially between fields.
- Asia > Middle East > Jordan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > New York (0.04)
- (7 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Government (0.93)
- Health & Medicine > Therapeutic Area (0.48)
- Leisure & Entertainment > Sports > Hockey (0.34)
Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs
Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process. This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts. The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones. Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.
- Oceania > New Zealand (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > South Yorkshire > Sheffield (0.04)
- (2 more...)
The Impact of AI on Academic Research and Publishing
Lund, Brady, Lamba, Manika, Oh, Sang Hoo
Keywords: Artificial Intelligence, Large Language Models, Academic Research, Publishing Ethics, Scholarly Publishing Abstract Generative artificial intelligence (AI) technologies like ChatGPT, have significantly impacted academic writing and publishing through their ability to generate content at levels comparable to or surpassing human writers. Through a review of recent interdisciplinary literature, this paper examines ethical considerations surrounding the integration of AI into academia, focusing on the potential for this technology to be used for scholarly misconduct and necessary oversight when using it for writing, editing, and reviewing of scholarly papers. The findings highlight the need for collaborative approaches to AI usage among publishers, editors, reviewers, and authors to ensure that this technology is used ethically and productively. Introduction Generative artificial intelligence technologies have rapidly transformed our daily lives, with one of the most profound impacts observed in the realm of writing. These models can produce content at a level that either matches or surpasses the quality of an average human writer. This transformation holds particular significance in academia, where faculty members are traditionally expected to engage in extensive scholarly writing. The increasing prevalence of generative artificial intelligence in academia raises substantial ethical concerns.
- North America > United States > Texas (0.14)
- North America > United States > Illinois (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.75)
AI system not yet ready to help peer reviewers assess research quality
Artificial intelligence could eventually help to award scores to the tens of thousands of papers submitted to the Research Excellence Framework by UK universities.Credit: Yuichiro Chino/Getty Researchers tasked with examining whether artificial intelligence (AI) technology could assist in the peer review of journal articles submitted to the United Kingdom's Research Excellence Framework (REF) say the system is not yet accurate enough to aid human assessment, and recommend further testing in a large-scale pilot scheme. The team's findings, published on 12 December, show that the AI system generated identical scores to human peer reviewers up to 72% of the time. When averaged out over the multiple submissions made by some institutions across a broad range of the 34 subject-based'units of assessment' that make up the REF, "the correlation between the human score and the AI score was very high", says data scientist Mike Thelwall at the University of Wolverhampton, UK, who is a co-author of the report. In its current form, however, the tool is most useful when assessing research output from institutions that submit a lot of articles to the REF, Thelwall says. It is less useful for smaller universities that submit only a handful of articles.