human judge
- North America > United States > Maryland (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
AI is coming to Olympic judging: what makes it a game changer?
AI is coming to Olympic judging: what makes it a game changer? As the International Olympic Committee (IOC) embraces AI-assisted judging, this technology promises greater consistency and improved transparency. Yet research suggests that trust, legitimacy, and cultural values may matter just as much as technical accuracy. In 2024, the IOC unveiled its Olympic AI Agenda, positioning artificial intelligence as a central pillar of future Olympic Games. This vision was reinforced at the very first Olympic AI Forum, held in November 2025, where athletes, federations, technology partners, and policymakers discussed how AI could support judging, athlete preparation, and the fan experience.
- South America > Chile (0.05)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.05)
- Asia > Middle East > Jordan (0.05)
- Research Report (0.55)
- Personal (0.48)
Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
Haldar, Rajarshi, Hockenmaier, Julia
As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.
- Europe > Austria > Vienna (0.14)
- Europe > France (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- (9 more...)
- Health & Medicine (0.67)
- Leisure & Entertainment (0.46)
- North America > United States > Maryland (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic
Zheng, Weibing, Turner, Laurah, Kropczynski, Jess, Ozer, Murat, Nguyen, Tri, Halse, Shane
Clinical communication skills are critical in medical education, and practicing and assessing clinical communication skills on a scale is challenging. Although LLM-powered clinical scenario simulations have shown promise in enhancing medical students' clinical practice, providing automated and scalable clinical evaluation that follows nuanced physician judgment is difficult. This paper combines fuzzy logic and Large Language Model (LLM) and proposes LLM-as-a-Fuzzy-Judge to address the challenge of aligning the automated evaluation of medical students' clinical skills with subjective physicians' preferences. LLM-as-a-Fuzzy-Judge is an approach that LLM is fine-tuned to evaluate medical students' utterances within student-AI patient conversation scripts based on human annotations from four fuzzy sets, including Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction. The methodology of this paper started from data collection from the LLM-powered medical education system, data annotation based on multidimensional fuzzy sets, followed by prompt engineering and the supervised fine-tuning (SFT) of the pre-trained LLMs using these human annotations. The results show that the LLM-as-a-Fuzzy-Judge achieves over 80\% accuracy, with major criteria items over 90\%, effectively leveraging fuzzy logic and LLM as a solution to deliver interpretable, human-aligned assessment. This work suggests the viability of leveraging fuzzy logic and LLM to align with human preferences, advances automated evaluation in medical education, and supports more robust assessment and judgment practices. The GitHub repository of this work is available at https://github.com/2sigmaEdTech/LLMAsAJudge
- South America > Argentina > Patagonia > Río Negro Province > Viedma (0.04)
- North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
- Asia > Singapore (0.04)
- Asia > Middle East > Israel (0.04)
- Health & Medicine (1.00)
- Education > Educational Setting (0.78)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Enhancing Selection of Climate Tech Startups with AI -- A Case Study on Integrating Human and AI Evaluations in the ClimaTech Great Global Innovation Challenge
Turliuk, Jennifer, Sevilla, Alejandro, Gorza, Daniela, Hynes, Tod
This case study examines the ClimaTech Great Global Innovation Challenge's approach to selecting climate tech startups by integrating human and AI evaluations. The competition aimed to identify top startups and enhance the accuracy and efficiency of the selection process through a hybrid model. Research shows data-driven approaches help VC firms reduce bias and improve decision-making. Machine learning models have outperformed human investors in deal screening, helping identify high-potential startups. Incorporating AI aimed to ensure more equitable and objective evaluations. The methodology included three phases: initial AI review, semi-finals judged by humans, and finals using a hybrid weighting. In phase one, 57 applications were scored by an AI tool built with StackAI and OpenAI's GPT-4o, and the top 36 advanced. In the semi-finals, human judges, unaware of AI scores, evaluated startups on team quality, market potential, and technological innovation. Each score - human or AI - was weighted equally, resulting in 75 percent human and 25 percent AI influence. In the finals, with five human judges, weighting shifted to 83.3 percent human and 16.7 percent AI. There was a moderate positive correlation between AI and human scores - Spearman's = 0.47 - indicating general alignment with key differences. Notably, the final four startups, selected mainly by humans, were among those rated highest by the AI. This highlights the complementary nature of AI and human judgment. The study shows that hybrid models can streamline and improve startup assessments. The ClimaTech approach offers a strong framework for future competitions by combining human expertise with AI capabilities.
- Banking & Finance > Trading (0.48)
- Banking & Finance > Capital Markets (0.34)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)
Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
Thakur, Nandan, Pradeep, Ronak, Upadhyay, Shivani, Campos, Daniel, Craswell, Nick, Lin, Jimmy
Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.
- Asia > Singapore (0.04)
- North America > United States > Maryland > Montgomery County > Gaithersburg (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (6 more...)
- Media > Music (0.30)
- Leisure & Entertainment (0.30)
Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?
Wang, Bo, Li, Yiqiao, Zhou, Jianlong, Chen, Fang
EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their 'black box' results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that while LLM-based judges effectively assess the quality of explanations using subjective metrics, they are not yet sufficiently developed to replace human judges in this role.
- Oceania > Australia > New South Wales > Sydney (0.05)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts
Murakami, Soichiro, Zhang, Peinan, Kamigaito, Hidetaka, Takamura, Hiroya, Okumura, Manabu
Effective linguistic choices that attract potential customers play crucial roles in advertising success. This study aims to explore the linguistic features of ad texts that influence human preferences. Although the creation of attractive ad texts is an active area of research, progress in understanding the specific linguistic features that affect attractiveness is hindered by several obstacles. First, human preferences are complex and influenced by multiple factors, including their content, such as brand names, and their linguistic styles, making analysis challenging. Second, publicly available ad text datasets that include human preferences are lacking, such as ad performance metrics and human feedback, which reflect people's interests. To address these problems, we present AdParaphrase, a paraphrase dataset that contains human preferences for pairs of ad texts that are semantically equivalent but differ in terms of wording and style. This dataset allows for preference analysis that focuses on the differences in linguistic features. Our analysis revealed that ad texts preferred by human judges have higher fluency, longer length, more nouns, and use of bracket symbols. Furthermore, we demonstrate that an ad text-generation model that considers these findings significantly improves the attractiveness of a given text. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase.