consultancy
On scalable oversight with weak LLMs judging strong LLMs
Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions;and compare to a baseline of direct question-answering, where the judge just answers outright without the AI.We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed.Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy.Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
Carro, María Victoria, Mester, Denise Alejandra, Nieto, Facundo, Stanchi, Oscar Agustín, Bergman, Guido Ernesto, Leiva, Mario Alejandro, Sprejer, Eitan, Gangi, Luca Nicolás Forziati, Selasco, Francisca Gauna, Corvalán, Juan Gustavo, Simari, Gerardo I., Martinez, María Vanina
The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models' prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge's presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (4 more...)
- Health & Medicine (0.93)
- Education (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Belief Revision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education (0.68)
- Banking & Finance (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Communications (0.67)
Tony Blair and Nick Clegg hosted dinner giving tech bosses access to UK minister
Blair (left) and Clegg hosted the private dinner at the five-star Corinthia hotel in London in January. Blair (left) and Clegg hosted the private dinner at the five-star Corinthia hotel in London in January. Exclusive: Six tech leaders dined with investment minister, documents reveal, underlining growing influence of ex-PM's consultancy Tony Blair and Nick Clegg hosted a private dinner earlier this year at which a select group of technology entrepreneurs were given access to a key minister, official documents have revealed. He and Clegg, the former deputy prime minister who at the time was a senior executive at Meta, invited leaders of six tech companies to dine with Poppy Gustafsson, who was the government's investment minister responsible for persuading firms to invest in Britain. Blair is an evangelical proponent of the revolutionary potential of technology to transform faltering public services and has long courted alliances with leaders in the industry.
- Europe > United Kingdom (1.00)
- Europe > Ukraine (0.06)
- Oceania > Australia (0.05)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education (0.68)
- Banking & Finance (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Communications (0.67)
Confirmation bias: A challenge for scalable oversight
Recchia, Gabriel, Mangat, Chatrik Singh, Nyachhyon, Jinu, Sharma, Mridul, Canavan, Callum, Epstein-Gross, Dylan, Abdulbari, Muhammed
Scalable oversight protocols aim to empower evaluators to accurately verify AI models more capable than themselves. However, human evaluators are subject to biases that can lead to systematic errors. We conduct two studies examining the performance of simple oversight protocols where evaluators know that the model is "correct most of the time, but not all of the time". We find no overall advantage for the tested protocols, although in Study 1, showing arguments in favor of both answers improves accuracy in cases where the model is incorrect. In Study 2, participants in both groups become more confident in the system's answers after conducting online research, even when those answers are incorrect. We also reanalyze data from prior work that was more optimistic about simple protocols, finding that human evaluators possessing knowledge absent from models likely contributed to their positive results--an advantage that diminishes as models continue to scale in capability. These findings underscore the importance of testing the degree to which oversight protocols are robust to evaluator biases, whether they outperform simple deference to the model under evaluation, and whether their performance scales with increasing problem difficulty and model capability.
- Europe > United Kingdom (0.14)
- North America > United States (0.04)
- Europe > Poland > Lesser Poland Province > Kraków (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.46)
- Health & Medicine (1.00)
- Education (1.00)
- Leisure & Entertainment > Sports > Tennis (0.46)
On scalable oversight with weak LLMs judging strong LLMs
Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions;and compare to a baseline of direct question-answering, where the judge just answers outright without the AI.We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed.Previous work assigned debaters/consultants an answer to argue for.
Debating for Better Reasoning: An Unsupervised Multimodal Approach
Adhikari, Ashutosh, Lapata, Mirella
As Large Language Models (LLMs) gain expertise across diverse domains and modalities, scalable oversight becomes increasingly challenging, particularly when their capabilities may surpass human evaluators. Debate has emerged as a promising mechanism for enabling such oversight. In this work, we extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models. We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments. In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement. Experiments on several multimodal tasks demonstrate that the debate framework consistently outperforms individual expert models. Moreover, judgments from weaker LLMs can help instill reasoning capabilities in vision-language models through finetuning.
- North America > United States > New Jersey (0.06)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (3 more...)
- Health & Medicine (0.70)
- Transportation (0.48)
- Education (0.46)
Training Language Models to Win Debates with Self-Play Improves Judge Accuracy
Arnesen, Samuel, Rein, David, Michael, Julian
We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that language model based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.
- North America > United States > New York (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Singapore (0.04)
- (2 more...)
On scalable oversight with weak LLMs judging strong LLMs
Kenton, Zachary, Siegel, Noah Y., Kramár, János, Brown-Cohen, Jonah, Albanie, Samuel, Bulian, Jannis, Agarwal, Rishabh, Lindner, David, Tang, Yunhao, Goodman, Noah D., Shah, Rohin
Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)