AITopics | human judge

Collaborating Authors

human judge

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

_NeurIPS_2023__BASALT_Benchmark

Stephanie Milani

Neural Information Processing SystemsFeb-13-2026, 02:43:07 GMT

TrueSkill can handle multi-player games.

artificial intelligence, machine learning, social media, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)

Industry: Leisure & Entertainment > Games (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)

Add feedback

AI is coming to Olympic judging: what makes it a game changer?

AIHubFeb-9-2026, 09:16:27 GMT

AI is coming to Olympic judging: what makes it a game changer? As the International Olympic Committee (IOC) embraces AI-assisted judging, this technology promises greater consistency and improved transparency. Yet research suggests that trust, legitimacy, and cultural values may matter just as much as technical accuracy. In 2024, the IOC unveiled its Olympic AI Agenda, positioning artificial intelligence as a central pillar of future Olympic Games. This vision was reinforced at the very first Olympic AI Forum, held in November 2025, where athletes, federations, technology partners, and policymakers discussed how AI could support judging, athlete preparation, and the fan experience.

artificial intelligence, olympic, social media, (16 more...)

AIHub

Genre:

Research Report (0.55)
Personal (0.48)

Industry: Leisure & Entertainment > Sports > Olympic Games (1.00)

Technology:

Information Technology > Communications > Social Media (0.72)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.48)

Add feedback

Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Haldar, Rajarshi, Hockenmaier, Julia

arXiv.org Artificial IntelligenceNov-3-2025

As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.27106

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine (0.67)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

_NeurIPS_2023__BASALT_Benchmark

Stephanie Milani

Neural Information Processing SystemsOct-8-2025, 20:15:03 GMT

TrueSkill can handle multi-player games.

artificial intelligence, machine learning, social media, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)

Industry: Leisure & Entertainment > Games (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)

Add feedback

Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question

Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu

Neural Information Processing SystemsOct-2-2025, 16:51:04 GMT

Neural Information Processing Systems http://nips.cc/

dataset, mqa model, representation, (15 more...)

Neural Information Processing Systems

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic

Zheng, Weibing, Turner, Laurah, Kropczynski, Jess, Ozer, Murat, Nguyen, Tri, Halse, Shane

arXiv.org Artificial IntelligenceJun-16-2025

Clinical communication skills are critical in medical education, and practicing and assessing clinical communication skills on a scale is challenging. Although LLM-powered clinical scenario simulations have shown promise in enhancing medical students' clinical practice, providing automated and scalable clinical evaluation that follows nuanced physician judgment is difficult. This paper combines fuzzy logic and Large Language Model (LLM) and proposes LLM-as-a-Fuzzy-Judge to address the challenge of aligning the automated evaluation of medical students' clinical skills with subjective physicians' preferences. LLM-as-a-Fuzzy-Judge is an approach that LLM is fine-tuned to evaluate medical students' utterances within student-AI patient conversation scripts based on human annotations from four fuzzy sets, including Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction. The methodology of this paper started from data collection from the LLM-powered medical education system, data annotation based on multidimensional fuzzy sets, followed by prompt engineering and the supervised fine-tuning (SFT) of the pre-trained LLMs using these human annotations. The results show that the LLM-as-a-Fuzzy-Judge achieves over 80\% accuracy, with major criteria items over 90\%, effectively leveraging fuzzy logic and LLM as a solution to deliver interpretable, human-aligned assessment. This work suggests the viability of leveraging fuzzy logic and LLM to align with human preferences, advances automated evaluation in medical education, and supports more robust assessment and judgment practices. The GitHub repository of this work is available at https://github.com/2sigmaEdTech/LLMAsAJudge

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.11221

Country:

South America > Argentina > Patagonia > Río Negro Province > Viedma (0.04)
North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine (1.00)
Education > Educational Setting (0.78)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing Selection of Climate Tech Startups with AI -- A Case Study on Integrating Human and AI Evaluations in the ClimaTech Great Global Innovation Challenge

Turliuk, Jennifer, Sevilla, Alejandro, Gorza, Daniela, Hynes, Tod

arXiv.org Artificial IntelligenceMay-29-2025

This case study examines the ClimaTech Great Global Innovation Challenge's approach to selecting climate tech startups by integrating human and AI evaluations. The competition aimed to identify top startups and enhance the accuracy and efficiency of the selection process through a hybrid model. Research shows data-driven approaches help VC firms reduce bias and improve decision-making. Machine learning models have outperformed human investors in deal screening, helping identify high-potential startups. Incorporating AI aimed to ensure more equitable and objective evaluations. The methodology included three phases: initial AI review, semi-finals judged by humans, and finals using a hybrid weighting. In phase one, 57 applications were scored by an AI tool built with StackAI and OpenAI's GPT-4o, and the top 36 advanced. In the semi-finals, human judges, unaware of AI scores, evaluated startups on team quality, market potential, and technological innovation. Each score - human or AI - was weighted equally, resulting in 75 percent human and 25 percent AI influence. In the finals, with five human judges, weighting shifted to 83.3 percent human and 16.7 percent AI. There was a moderate positive correlation between AI and human scores - Spearman's = 0.47 - indicating general alignment with key differences. Notably, the final four startups, selected mainly by humans, were among those rated highest by the AI. This highlights the complementary nature of AI and human judgment. The study shows that hybrid models can streamline and improve startup assessments. The ClimaTech approach offers a strong framework for future competitions by combining human expertise with AI capabilities.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.21562

Genre: Research Report > Experimental Study (0.66)

Industry:

Banking & Finance > Trading (0.48)
Banking & Finance > Capital Markets (0.34)

Technology:

Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.62)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

Thakur, Nandan, Pradeep, Ronak, Upadhyay, Shivani, Campos, Daniel, Craswell, Nick, Lin, Jimmy

arXiv.org Artificial IntelligenceApr-22-2025

Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2504.15205

Country: North America > United States > Maryland (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (0.30)
Leisure & Entertainment (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?

Wang, Bo, Li, Yiqiao, Zhou, Jianlong, Chen, Fang

arXiv.org Artificial IntelligenceFeb-27-2025

EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their 'black box' results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that while LLM-based judges effectively assess the quality of explanations using subjective metrics, they are not yet sufficiently developed to replace human judges in this role.

explanation, mistral-7, significant difference, (13 more...)

arXiv.org Artificial Intelligence

2502.20635

Country:

Oceania > Australia > New South Wales > Sydney (0.05)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts

Murakami, Soichiro, Zhang, Peinan, Kamigaito, Hidetaka, Takamura, Hiroya, Okumura, Manabu

arXiv.org Artificial IntelligenceFeb-11-2025

Effective linguistic choices that attract potential customers play crucial roles in advertising success. This study aims to explore the linguistic features of ad texts that influence human preferences. Although the creation of attractive ad texts is an active area of research, progress in understanding the specific linguistic features that affect attractiveness is hindered by several obstacles. First, human preferences are complex and influenced by multiple factors, including their content, such as brand names, and their linguistic styles, making analysis challenging. Second, publicly available ad text datasets that include human preferences are lacking, such as ad performance metrics and human feedback, which reflect people's interests. To address these problems, we present AdParaphrase, a paraphrase dataset that contains human preferences for pairs of ad texts that are semantically equivalent but differ in terms of wording and style. This dataset allows for preference analysis that focuses on the differences in linguistic features. Our analysis revealed that ad texts preferred by human judges have higher fluency, longer length, more nouns, and use of bracket symbols. Furthermore, we demonstrate that an ad text-generation model that considers these findings significantly improves the attractiveness of a given text. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2502.04674

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Marketing (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback