Yuksekgonul, Mert
SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning
Zhao, Wanjia, Yuksekgonul, Mert, Wu, Shirley, Zou, James
Multi-agent AI systems powered by large language models (LLMs) are increasingly applied to solve complex tasks. However, these systems often rely on fragile, manually designed prompts and heuristics, making optimization difficult. A key challenge in optimizing multi-agent systems is acquiring suitable training data for specialized agents. We introduce SiriuS, a self-improving, reasoning-driven optimization framework for multi-agent systems. Central to our approach is the construction of an experience library: a repository of high-quality reasoning trajectories. The library is built by retaining reasoning steps that lead to successful outcomes, providing a robust training set for optimizing multi-agent system. Additionally, we introduce a library augmentation procedure that refines unsuccessful trajectories, further enriching the library. SiriuS boosts performance by 2.86\% to 21.88\% on reasoning and biomedical QA and enhances agent negotiation in competitive settings. Our results show that SiriuS enhances multi-agent performance while generating reusable data for self-correction and self-play enhancement in the future.
A ghost mechanism: An analytical model of abrupt learning
Dinc, Fatih, Cirakman, Ege, Jiang, Yiqi, Yuksekgonul, Mert, Schnitzer, Mark J., Tanaka, Hidenori
\emph{Abrupt learning} is commonly observed in neural networks, where long plateaus in network performance are followed by rapid convergence to a desirable solution. Yet, despite its common occurrence, the complex interplay of task, network architecture, and learning rule has made it difficult to understand the underlying mechanisms. Here, we introduce a minimal dynamical system trained on a delayed-activation task and demonstrate analytically how even a one-dimensional system can exhibit abrupt learning through ghost points rather than bifurcations. Through our toy model, we show that the emergence of a ghost point destabilizes learning dynamics. We identify a critical learning rate that prevents learning through two distinct loss landscape features: a no-learning zone and an oscillatory minimum. Testing these predictions in recurrent neural networks (RNNs), we confirm that ghost points precede abrupt learning and accompany the destabilization of learning. We demonstrate two complementary remedies: lowering the model output confidence prevents the network from getting stuck in no-learning zones, while increasing trainable ranks beyond task requirements (\textit{i.e.}, adding sloppy parameters) provides more stable learning trajectories. Our model reveals a bifurcation-free mechanism for abrupt learning and illustrates the importance of both deliberate uncertainty and redundancy in stabilizing learning dynamics.
TextGrad: Automatic "Differentiation" via Text
Yuksekgonul, Mert, Bianchi, Federico, Boen, Joseph, Liu, Sheng, Huang, Zhi, Guestrin, Carlos, Zou, James
AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.
How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis
Bianchi, Federico, Chia, Patrick John, Yuksekgonul, Mert, Tagliabue, Jacopo, Jurafsky, Dan, Zou, James
Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena: a flexible framework for evaluating and probing the negotiation abilities of LLM agents. We implemented three types of scenarios in NegotiationArena to assess LLM's behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations). Each scenario allows for multiple turns of flexible dialogues between LLM agents to allow for more complex negotiations. Interestingly, LLM agents can significantly boost their negotiation outcomes by employing certain behavioral tactics. For example, by pretending to be desolate and desperate, LLMs can improve their payoffs by 20\% when negotiating against the standard GPT-4. We also quantify irrational negotiation behaviors exhibited by the LLM agents, many of which also appear in humans. Together, \NegotiationArena offers a new environment to investigate LLM interactions, enabling new insights into LLM's theory of mind, irrationality, and reasoning abilities.
ChatGPT Exhibits Gender and Racial Biases in Acute Coronary Syndrome Management
Zhang, Angela, Yuksekgonul, Mert, Guild, Joshua, Zou, James, Wu, Joseph C.
Recent breakthroughs in large language models (LLMs) have led to their rapid dissemination and widespread use. One early application has been to medicine, where LLMs have been investigated to streamline clinical workflows and facilitate clinical analysis and decision-making. However, a leading barrier to the deployment of Artificial Intelligence (AI) and in particular LLMs has been concern for embedded gender and racial biases. Here, we evaluate whether a leading LLM, ChatGPT 3.5, exhibits gender and racial bias in clinical management of acute coronary syndrome (ACS). We find that specifying patients as female, African American, or Hispanic resulted in a decrease in guideline recommended medical management, diagnosis, and symptom management of ACS. Most notably, the largest disparities were seen in the recommendation of coronary angiography or stress testing for the diagnosis and further intervention of ACS and recommendation of high intensity statins. These disparities correlate with biases that have been observed clinically and have been implicated in the differential gender and racial morbidity and mortality outcomes of ACS and coronary artery disease. Furthermore, we find that the largest disparities are seen during unstable angina, where fewer explicit clinical guidelines exist. Finally, we find that through asking ChatGPT 3.5 to explain its reasoning prior to providing an answer, we are able to improve clinical accuracy and mitigate instances of gender and racial biases. This is among the first studies to demonstrate that the gender and racial biases that LLMs exhibit do in fact affect clinical management. Additionally, we demonstrate that existing strategies that improve LLM performance not only improve LLM performance in clinical management, but can also be used to mitigate gender and racial biases.
Beyond Confidence: Reliable Models Should Also Consider Atypicality
Yuksekgonul, Mert, Zhang, Linjun, Zou, James, Guestrin, Carlos
While most machine learning models can provide confidence in their predictions, confidence is insufficient to understand a prediction's reliability. For instance, the model may have a low confidence prediction if the input is not well-represented in the training dataset or if the input is inherently ambiguous. In this work, we investigate the relationship between how atypical (rare) a sample or a class is and the reliability of a model's predictions. We first demonstrate that atypicality is strongly related to miscalibration and accuracy. In particular, we empirically show that predictions for atypical inputs or atypical classes are more overconfident and have lower accuracy. Using these insights, we show incorporating atypicality improves uncertainty quantification and model performance for discriminative neural networks and large language models. In a case study, we show that using atypicality improves the performance of a skin lesion classifier across different skin tone groups without having access to the group attributes. Overall, we propose that models should use not only confidence but also atypicality to improve uncertainty quantification and performance. Our results demonstrate that simple post-hoc atypicality estimators can provide significant value.
KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval
Abdin, Marah I, Gunasekar, Suriya, Chandrasekaran, Varun, Li, Jerry, Yuksekgonul, Mert, Peshawaria, Rahee Ghosh, Naik, Ranjita, Nushi, Besmira
We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.
Diversity of Thought Improves Reasoning Abilities of Large Language Models
Naik, Ranjita, Chandrasekaran, Varun, Yuksekgonul, Mert, Palangi, Hamid, Nushi, Besmira
Large language models (LLMs) are documented to struggle in settings that require complex reasoning. Nevertheless, instructing the model to break down the problem into smaller reasoning steps (Wei et al., 2022), or ensembling various generations through modifying decoding steps (Wang et al., 2023) boosts performance. Current methods assume that the input prompt is fixed and expect the decoding strategies to introduce the diversity needed for ensembling. In this work, we relax this assumption and discuss how one can create and leverage variations of the input prompt as a means to diversity of thought to improve model performance. We propose a method that automatically improves prompt diversity by soliciting feedback from the LLM to ideate approaches that fit for the problem. We then ensemble the diverse prompts in our method DIV-SE (DIVerse reasoning path Self-Ensemble) across multiple inference calls. We also propose a cost-effective alternative where diverse prompts are used within a single inference call; we call this IDIV-SE (In-call DIVerse reasoning path Self-Ensemble). Under a fixed generation budget, DIV-SE and IDIV-SE outperform the previously discussed baselines using both GPT-3.5 and GPT-4 on several reasoning benchmarks, without modifying the decoding process. Additionally, DIV-SE advances state-of-the-art performance on recent planning benchmarks (Valmeekam et al., 2023), exceeding the highest previously reported accuracy by at least 29.6 percentage points on the most challenging 4/5 Blocksworld task. Our results shed light on how to enforce prompt diversity toward LLM reasoning and thereby improve the pareto frontier of the accuracy-cost trade-off.
Holistic Evaluation of Language Models
Liang, Percy, Bommasani, Rishi, Lee, Tony, Tsipras, Dimitris, Soylu, Dilara, Yasunaga, Michihiro, Zhang, Yian, Narayanan, Deepak, Wu, Yuhuai, Kumar, Ananya, Newman, Benjamin, Yuan, Binhang, Yan, Bobby, Zhang, Ce, Cosgrove, Christian, Manning, Christopher D., Rรฉ, Christopher, Acosta-Navas, Diana, Hudson, Drew A., Zelikman, Eric, Durmus, Esin, Ladhak, Faisal, Rong, Frieda, Ren, Hongyu, Yao, Huaxiu, Wang, Jue, Santhanam, Keshav, Orr, Laurel, Zheng, Lucia, Yuksekgonul, Mert, Suzgun, Mirac, Kim, Nathan, Guha, Neel, Chatterji, Niladri, Khattab, Omar, Henderson, Peter, Huang, Qian, Chi, Ryan, Xie, Sang Michael, Santurkar, Shibani, Ganguli, Surya, Hashimoto, Tatsunori, Icard, Thomas, Zhang, Tianyi, Chaudhary, Vishrav, Wang, William, Li, Xuechen, Mai, Yifan, Zhang, Yuhui, Koreeda, Yuta
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models
Yuksekgonul, Mert, Chandrasekaran, Varun, Jones, Erik, Gunasekar, Suriya, Naik, Ranjita, Palangi, Hamid, Kamar, Ece, Nushi, Besmira
We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as Constraint Satisfaction Problems and use this framework to investigate how the model interacts internally with factual constraints. Specifically, we discover a strong positive relation between the model's attention to constraint tokens and the factual accuracy of its responses. In our curated suite of 11 datasets with over 40,000 prompts, we study the task of predicting factual errors with the Llama-2 family across all scales (7B, 13B, 70B). We propose SAT Probe, a method probing self-attention patterns, that can predict constraint satisfaction and factual errors, and allows early error identification. The approach and findings demonstrate how using the mechanistic understanding of factuality in LLMs can enhance reliability.