Who Determines What Is Relevant? Humans or AI? Why Not Both?

Communications of the ACM 

To measure progress on better methods for Web search, question answering, conversational agents, or retrieval from knowledge bases, it is essential to know which responses are relevant to a user's information need. Such judgments of what is relevant are traditionally obtained by asking human assessors. With the latest improvements on autoregressive large language models (LLMs) such as like ChatGPT, researchers started to experiment with the idea of replacing human relevance assessment by LLMs.9 The approach is simple: Just ask an LLM chatbot whether a response is relevant for an information need, and it does provide an "opinion." In recent empirical studies on Web search3 but also in programming,7 human–computer interaction,5 or protein function prediction,10 it has been shown that LLM-generated opinions often agree with the assessment of humans.