judgement
- North America > United States > New Hampshire (0.05)
- North America > United States > Virginia (0.04)
- North America > United States > Massachusetts (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.31)
Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection
Piot, Paloma, Otero, David, Martín-Rodilla, Patricia, Parapar, Javier
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation, yet detecting it remains difficult. Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign. Traditional annotation agreement metrics, such as Cohen's $κ$, oversimplify this disagreement, treating it as an error rather than meaningful diversity. Meanwhile, Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement, especially in subjective tasks. In this work, we reexamine LLM reliability using a subjectivity-aware framework, cross-Rater Reliability (xRR), revealing that even under fairer lens, LLMs still diverge from humans. Yet this limitation opens an opportunity: we find that LLM-generated annotations can reliably reflect performance trends across classification models, correlating with human evaluations. We test this by examining whether LLM-generated annotations preserve the relative ordering of model performance derived from human evaluation (i.e. whether models ranked as more reliable by human annotators preserve the same order when evaluated with LLM-generated labels). Our results show that, although LLMs differ from humans at the instance level, they reproduce similar ranking and classification patterns, suggesting their potential as proxy evaluators. While not a substitute for human annotators, they might serve as a scalable proxy for evaluation in subjective NLP tasks.
- Europe > Austria > Vienna (0.14)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Artificial Intelligence Applications in Horizon Scanning for Infectious Diseases
Miles, Ian, Wakimoto, Mayumi, Meira, Wagner Jr., Paula, Daniela, Ticiane, Daylene, Rosa, Bruno, Biddulph, Jane, Georgiou, Stelios, Ermida, Valdir
This review explores the integration of Artificial Intelligence into Horizon Scanning, focusing on identifying and responding to emerging threats and opportunities linked to Infectious Diseases. We examine how AI tools can enhance signal detection, data monitoring, scenario analysis, and decision support. We also address the risks associated with AI adoption and propose strategies for effective implementation and governance. The findings contribute to the growing body of Foresight literature by demonstrating the potential and limitations of AI in Public Health preparedness.
- Asia > Japan > Kyūshū & Okinawa > Kyūshū > Kumamoto Prefecture > Kumamoto (0.04)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- South America > Brazil > Minas Gerais (0.04)
- (6 more...)
- Overview (1.00)
- Research Report (0.82)
- Education (0.68)
- Health & Medicine (0.46)
Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection
Mahdavi, Sadegh, Kisacanin, Branislav, Toshniwal, Shubham, Du, Wei, Moshkov, Ivan, Armstrong, George, Liao, Renjie, Thrampoulidis, Christos, Gitman, Igor
Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model's performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.
- Europe > Austria > Vienna (0.14)
- North America > Canada > British Columbia (0.04)
- Europe > Serbia (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New Hampshire (0.04)
- North America > United States > Virginia (0.04)
- (2 more...)
We disagree on the judgement with our highest respect, due to the nontrivial technical differences and results
We thank all reviewers for their helpful and constructive comments. We'll further improve in the final version. In particular, our contributions are: (1) We introduce generalization bounds of learning algorithms on various losses, i.e. Besides, it's nontrivial to analyze the relationship between HL and RL, especially for the second inequality We'll add the discussions in the final version. We'll make the comparison and statements Below, we discuss the pros and cons of each one in detail.
- Education (0.68)
- Health & Medicine (0.46)
- North America > United States > New Hampshire (0.05)
- North America > United States > Virginia (0.04)
- North America > United States > Massachusetts (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.31)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New Hampshire (0.04)
- North America > United States > Virginia (0.04)
- (2 more...)