Bhatt, Umang
When Should We Orchestrate Multiple Agents?
Bhatt, Umang, Kapoor, Sanyam, Upadhyay, Mihir, Sucholutsky, Ilia, Quinzan, Francesco, Collins, Katherine M., Weller, Adrian, Wilson, Andrew Gordon, Zafar, Muhammad Bilal
Strategies for orchestrating the interactions between multiple agents, both human and artificial, can wildly overestimate performance and underestimate the cost of orchestration. We design a framework to orchestrate agents under realistic conditions, such as inference costs or availability constraints. We show theoretically that orchestration is only effective if there are performance or cost differentials between agents. We then empirically demonstrate how orchestration between multiple agents can be helpful for selecting agents in a simulated environment, picking a learning strategy in the infamous Rogers' Paradox from social science, and outsourcing tasks to other agents during a question-answer task in a user study.
Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems
Barker, Matthew, Bell, Andrew, Thomas, Evan, Carr, James, Andrews, Thomas, Bhatt, Umang
While Retrieval Augmented Generation (RAG) has emerged as a popular technique for improving Large Language Model (LLM) systems, it introduces a large number of choices, parameters and hyperparameters that must be made or tuned. This includes the LLM, embedding, and ranker models themselves, as well as hy-perparameters governing individual RAG components. Y et, collectively optimizing the entire configuration in a RAG or LLM system remains under-explored-- especially in multi-objective settings--due to intractably large solution spaces, noisy objective evaluations, and the high cost of evaluations. In this work, we introduce the first approach for multi-objective parameter optimization of cost, latency, safety and alignment over entire LLM and RAG systems. We find that Bayesian optimization methods significantly outperform baseline approaches, obtaining a superior Pareto front on two new RAG benchmark tasks. We conclude our work with important considerations for practitioners who are designing multi-objective RAG systems, highlighting nuances such as how optimal configurations may not generalize across tasks and objectives. Retrieval Augmented Generation (RAG) has emerged as a popular technique for improving the performance of Large Language Models (LLMs) on question-answering tasks over specific datasets. A benefit of using RAG pipelines is that they can often achieve high performance on specific tasks without the need for extensive alignment and fine-tuning (Gupta et al., 2024), a costly and time-consuming process. However, the end-to-end pipeline of a RAG system is dependent on many parameters that span different components (or modules) of the system, such as the choice of LLM, the embedding model used in retrieval, the number of chunks retrieved and hyperparameters governing a reranking model. Examples of choices, parameters, and hyperparameters that are often made or tuned when implementing a RAG pipeline are listed in Table 1.
Revisiting Rogers' Paradox in the Context of Human-AI Interaction
Collins, Katherine M., Bhatt, Umang, Sucholutsky, Ilia
Humans learn about the world, and how to act in the world, in many ways: from individually conducting experiments to observing and reproducing others' behavior. Different learning strategies come with different costs and likelihoods of successfully learning more about the world. The choice that any one individual makes of how to learn can have an impact on the collective understanding of a whole population if people learn from each other. Alan Rogers developed simulations of a population of agents to study these network phenomena where agents could individually or socially learn amidst a dynamic, uncertain world and uncovered a confusing result: the availability of cheap social learning yielded no benefit to population fitness over individual learning. This paradox spawned decades of work trying to understand and uncover factors that foster the relative benefit of social learning that centuries of human behavior suggest exists. What happens in such network models now that humans can socially learn from AI systems that are themselves socially learning from us? We revisit Rogers' Paradox in the context of human-AI interaction to probe a simplified network of humans and AI systems learning together about an uncertain world. We propose and examine the impact of several learning strategies on the quality of the equilibrium of a society's 'collective world model'. We consider strategies that can be undertaken by various stakeholders involved in a single human-AI interaction: human, AI model builder, and society or regulators around the interaction. We then consider possible negative feedback loops that may arise from humans learning socially from AI: that learning from the AI may impact our own ability to learn about the world. We close with open directions into studying networks of human and AI systems that can be explored in enriched versions of our simulation framework.
Modulating Language Model Experiences through Frictions
Collins, Katherine M., Chen, Valerie, Sucholutsky, Ilia, Kirk, Hannah Rose, Sadek, Malak, Sargeant, Holli, Talwalkar, Ameet, Weller, Adrian, Bhatt, Umang
Language models are transforming the ways that their users engage with the world. Despite impressive capabilities, over-consumption of language model outputs risks propagating unchecked errors in the short-term and damaging human capabilities for critical thinking in the long-term, particularly in knowledge-based tasks. How can we develop scaffolding around language models to curate more appropriate use? We propose selective frictions for language model experiences, inspired by behavioral science interventions, to dampen misuse. Frictions involve small modifications to a user's experience, e.g., the addition of a button impeding model access and reminding a user of their expertise relative to the model. Through a user study with real humans, we observe shifts in user behavior from the imposition of a friction over LLMs in the context of a multi-topic question-answering task as a representative task that people may use LLMs for, e.g., in education and information retrieval. We find that frictions modulate over-reliance by driving down users' click rates while minimally affecting accuracy for those topics. Yet, frictions may have unintended effects. We find marked differences in users' click behaviors even on topics where frictions were not provisioned. Our contributions motivate further study of human-AI behavioral interaction to inform more effective and appropriate LLM use.
Large Language Models Must Be Taught to Know What They Don't Know
Kapoor, Sanyam, Gruver, Nate, Roberts, Manley, Collins, Katherine, Pal, Arka, Bhatt, Umang, Weller, Adrian, Dooley, Samuel, Goldblum, Micah, Wilson, Andrew Gordon
When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study.
Representational Alignment Supports Effective Machine Teaching
Sucholutsky, Ilia, Collins, Katherine M., Malaviya, Maya, Jacoby, Nori, Liu, Weiyang, Sumers, Theodore R., Korakakis, Michalis, Bhatt, Umang, Ho, Mark, Tenenbaum, Joshua B., Love, Brad, Pardos, Zachary A., Weller, Adrian, Griffiths, Thomas L.
A good teacher should not only be knowledgeable; but should be able to communicate in a way that the student understands -- to share the student's representation of the world. In this work, we integrate insights from machine teaching and pragmatic communication with the burgeoning literature on representational alignment to characterize a utility curve defining a relationship between representational alignment and teacher capability for promoting student learning. To explore the characteristics of this utility curve, we design a supervised learning environment that disentangles representational alignment from teacher accuracy. We conduct extensive computational experiments with machines teaching machines, complemented by a series of experiments in which machines teach humans. Drawing on our findings that improved representational alignment with a student improves student learning outcomes (i.e., task accuracy), we design a classroom matching procedure that assigns students to teachers based on the utility curve. If we are to design effective machine teachers, it is not enough to build teachers that are accurate -- we want teachers that can align, representationally, to their students too.
When Should Algorithms Resign?
Bhatt, Umang, Sargeant, Holli
This paper discusses algorithmic resignation, a strategic approach for managing the use of AI systems within organizations. Algorithmic resignation involves the deliberate and informed disengagement from AI assistance in certain scenarios, by embedding governance mechanisms directly into AI systems. Our proposal is not merely about disuse of AI but includes guiding when and how these systems should be used or avoided. We discuss the multifaceted benefits of algorithmic resignation, spanning economic efficiency, reputational gains, and legal compliance. Further, we outline the operationalization of resignation through various methods such as positive and negative nudges, stakeholder incentive alignment, and careful consideration of the level of AI engagement. Using techniques like barring access to AI outputs selectively or providing explicit disclaimers on system performance, algorithmic resignation not only mitigates risks associated with AI but also leverages its benefits, ensuring the responsible and effective use of AI systems.
Comparing Abstraction in Humans and Large Language Models Using Multimodal Serial Reproduction
Kumar, Sreejan, Marjieh, Raja, Zhang, Byron, Campbell, Declan, Hu, Michael Y., Bhatt, Umang, Lake, Brenden, Griffiths, Thomas L.
Humans extract useful abstractions of the world from noisy sensory data. Serial reproduction allows us to study how people construe the world through a paradigm similar to the game of telephone, where one person observes a stimulus and reproduces it for the next to form a chain of reproductions. Past serial reproduction experiments typically employ a single sensory modality, but humans often communicate abstractions of the world to each other through language. To investigate the effect language on the formation of abstractions, we implement a novel multimodal serial reproduction framework by asking people who receive a visual stimulus to reproduce it in a linguistic format, and vice versa. We ran unimodal and multimodal chains with both humans and GPT-4 and find that adding language as a modality has a larger effect on human reproductions than GPT-4's. This suggests human visual and linguistic representations are more dissociable than those of GPT-4.
Evaluating Language Models for Mathematics through Interactions
Collins, Katherine M., Jiang, Albert Q., Frieder, Simon, Wong, Lionel, Zilka, Miri, Bhatt, Umang, Lukasiewicz, Thomas, Wu, Yuhuai, Tenenbaum, Joshua B., Hart, William, Gowers, Timothy, Li, Wenda, Weller, Adrian, Jamnik, Mateja
There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs, and is insufficient for making an informed decision about which LLMs and under which assistive settings can they be sensibly used. Static assessment fails to account for the essential interactive element in LLM deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty respond well to user corrections, and are more interpretable and concise may constitute better assistants. Interactive evaluation is a promising way to navigate the capability of these models; humans should be aware of language models' algebraic fallibility and discern where they are appropriate to use.
Human-in-the-Loop Mixup
Collins, Katherine M., Bhatt, Umang, Liu, Weiyang, Piratla, Vihari, Sucholutsky, Ilia, Love, Bradley, Weller, Adrian
Aligning model representations to humans has been found to improve robustness and generalization. However, such methods often focus on standard observational data. Synthetic data is proliferating and powering many advances in machine learning; yet, it is not always clear whether synthetic labels are perceptually aligned to humans -- rendering it likely model representations are not human aligned. We focus on the synthetic data used in mixup: a powerful regularizer shown to improve model robustness, generalization, and calibration. We design a comprehensive series of elicitation interfaces, which we release as HILL MixE Suite, and recruit 159 participants to provide perceptual judgments along with their uncertainties, over mixup examples. We find that human perceptions do not consistently align with the labels traditionally used for synthetic points, and begin to demonstrate the applicability of these findings to potentially increase the reliability of downstream models, particularly when incorporating human uncertainty. We release all elicited judgments in a new data hub we call H-Mix.