Goto

Collaborating Authors

 consultant


On scalable oversight with weak LLMs judging strong LLMs

Neural Information Processing Systems

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions;and compare to a baseline of direct question-answering, where the judge just answers outright without the AI.We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed.Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy.Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.


AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

Carro, María Victoria, Mester, Denise Alejandra, Nieto, Facundo, Stanchi, Oscar Agustín, Bergman, Guido Ernesto, Leiva, Mario Alejandro, Sprejer, Eitan, Gangi, Luca Nicolás Forziati, Selasco, Francisca Gauna, Corvalán, Juan Gustavo, Simari, Gerardo I., Martinez, María Vanina

arXiv.org Artificial Intelligence

The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models' prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge's presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.






AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants

TIME - Tech

RadVid-19, a program which identifies lung injuries through artificial intelligence, is used at the University of Sao Paulo in Brazil. RadVid-19, a program which identifies lung injuries through artificial intelligence, is used at the University of Sao Paulo in Brazil. The tasks resemble those that lawyers, doctors, financial analysts, and management consultants solve for a living. One asks for a diagnosis of a six-year-old patient based on nine pieces of multimedia evidence; another asks for legal advice on a musician's estate; a third calls for a valuation of part of a healthcare technology company. Mercor, which claims to supply "expert data" to every top AI company, says that it spent more than $500,000 to develop 200 tasks that test whether AIs can perform knowledge work with high economic value across law, medicine, finance, and management consulting.


MaRGen: Multi-Agent LLM Approach for Self-Directed Market Research and Analysis

Koshkin, Roman, Dai, Pengyu, Fujikawa, Nozomi, Togami, Masahito, Visentini-Scarzanella, Marco

arXiv.org Artificial Intelligence

We present an autonomous framework that leverages Large Language Models (LLMs) to automate end-to-end business analysis and market report generation. At its core, the system employs specialized agents - Researcher, Reviewer, Writer, and Retriever - that collaborate to analyze data and produce comprehensive reports. These agents learn from real professional consultants' presentation materials at Amazon through in-context learning to replicate professional analytical methodologies. The framework executes a multi-step process: querying databases, analyzing data, generating insights, creating visualizations, and composing market reports. We also introduce a novel LLM-based evaluation system for assessing report quality, which shows alignment with expert human evaluations. Building on these evaluations, we implement an iterative improvement mechanism that optimizes report quality through automated review cycles. Experimental results show that report quality can be improved by both automated review cycles and consultants' unstructured knowledge. In experimental validation, our framework generates detailed 6-page reports in 7 minutes at a cost of approximately \$1. Our work could be an important step to automatically create affordable market insights.


Confirmation bias: A challenge for scalable oversight

Recchia, Gabriel, Mangat, Chatrik Singh, Nyachhyon, Jinu, Sharma, Mridul, Canavan, Callum, Epstein-Gross, Dylan, Abdulbari, Muhammed

arXiv.org Artificial Intelligence

Scalable oversight protocols aim to empower evaluators to accurately verify AI models more capable than themselves. However, human evaluators are subject to biases that can lead to systematic errors. We conduct two studies examining the performance of simple oversight protocols where evaluators know that the model is "correct most of the time, but not all of the time". We find no overall advantage for the tested protocols, although in Study 1, showing arguments in favor of both answers improves accuracy in cases where the model is incorrect. In Study 2, participants in both groups become more confident in the system's answers after conducting online research, even when those answers are incorrect. We also reanalyze data from prior work that was more optimistic about simple protocols, finding that human evaluators possessing knowledge absent from models likely contributed to their positive results--an advantage that diminishes as models continue to scale in capability. These findings underscore the importance of testing the degree to which oversight protocols are robust to evaluator biases, whether they outperform simple deference to the model under evaluation, and whether their performance scales with increasing problem difficulty and model capability.


SMARTAPS: Tool-augmented LLMs for Operations Management

Yu, Timothy Tin Long, Mostajabdaveh, Mahdi, Byusa, Jabo Serge, Ramamonjison, Rindra, Carenini, Giuseppe, Mao, Kun, Zhou, Zirui, Zhang, Yong

arXiv.org Artificial Intelligence

Large language models (LLMs) present intriguing opportunities to enhance user interaction with traditional algorithms and tools in real-world applications. An advanced planning system (APS) is a sophisticated software that leverages optimization to help operations planners create, interpret, and modify an operational plan. While highly beneficial, many customers are priced out of using an APS due to the ongoing costs of consultants responsible for customization and maintenance. To address the need for a more accessible APS expressed by supply chain planners, we present SmartAPS, a conversational system built on a tool-augmented LLM. Our system provides operations planners with an intuitive natural language chat interface, allowing them to query information, perform counterfactual reasoning, receive recommendations, and execute scenario analysis to better manage their operation. A short video demonstrating the system has been released: https://youtu.be/KtIrJjlDbyw