Kalai, Adam Tauman
First-Person Fairness in Chatbots
Eloundou, Tyna, Beutel, Alex, Robinson, David G., Gu-Lemberg, Keren, Brakman, Anna-Luisa, Mishkin, Pamela, Shah, Meghan, Heidecke, Johannes, Weng, Lilian, Kalai, Adam Tauman
Chatbots like ChatGPT are used for diverse purposes, ranging from resume writing to entertainment. These real-world applications are different from the institutional uses, such as resume screening or credit scoring, which have been the focus of much of AI research on fairness. Ensuring equitable treatment for all users in these first-person contexts is critical. In this work, we study "first-person fairness," which means fairness toward the chatbot user. This includes providing high-quality responses to all users regardless of their identity or background and avoiding harmful stereotypes. We propose a scalable, privacy-preserving method for evaluating one aspect of first-person fairness across a large, heterogeneous corpus of real-world chatbot interactions. Specifically, we assess potential bias linked to users' names, which can serve as proxies for demographic attributes like gender or race, in chatbot systems such as ChatGPT, which provide mechanisms for storing and using user names. Our method leverages a second language model to privately analyze name-sensitivity in the chatbot's responses. We verify the validity of these annotations through independent human evaluation. Further, we show that post-training interventions, including RL, significantly mitigate harmful stereotypes. Our approach also yields succinct descriptions of response differences across tasks. For instance, in the "writing a story" task, chatbot responses show a tendency to create protagonists whose gender matches the likely gender inferred from the user's name. Moreover, a pattern emerges where users with female-associated names receive responses with friendlier and simpler language slightly more often than users with male-associated names. Finally, we provide the system messages required for external researchers to further investigate ChatGPT's behavior with hypothetical user profiles.
Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding
Suzgun, Mirac, Kalai, Adam Tauman
We introduce meta-prompting, an effective scaffolding technique designed to enhance the functionality of language models (LMs). This approach transforms a single LM into a multi-faceted conductor, adept at managing and integrating multiple independent LM queries. By employing high-level instructions, meta-prompting guides the LM to break down complex tasks into smaller, more manageable subtasks. These subtasks are then handled by distinct "expert" instances of the same LM, each operating under specific, tailored instructions. Central to this process is the LM itself, in its role as the conductor, which ensures seamless communication and effective integration of the outputs from these expert models. It additionally employs its inherent critical thinking and robust verification processes to refine and authenticate the end result. This collaborative prompting approach empowers a single LM to simultaneously act as a comprehensive orchestrator and a panel of diverse experts, significantly enhancing its performance across a wide array of tasks. The zero-shot, task-agnostic nature of meta-prompting greatly simplifies user interaction by obviating the need for detailed, task-specific instructions. Furthermore, our research demonstrates the seamless integration of external tools, such as a Python interpreter, into the meta-prompting framework, thereby broadening its applicability and utility. Through rigorous experimentation with GPT-4, we establish the superiority of meta-prompting over conventional scaffolding methods: When averaged across all tasks, including the Game of 24, Checkmate-in-One, and Python Programming Puzzles, meta-prompting, augmented with a Python interpreter functionality, surpasses standard prompting by 17.1%, expert (dynamic) prompting by 17.3%, and multipersona prompting by 15.2%.
Partial Matrix Completion
Hazan, Elad, Kalai, Adam Tauman, Kanade, Varun, Mohri, Clara, Sun, Y. Jennifer
The matrix completion problem aims to reconstruct a low-rank matrix based on a revealed set of possibly noisy entries. Prior works consider completing the entire matrix with generalization error guarantees. However, the completion accuracy can be drastically different over different entries. This work establishes a new framework of partial matrix completion, where the goal is to identify a large subset of the entries that can be completed with high confidence. We propose an efficient algorithm with the following provable guarantees. Given access to samples from an unknown and arbitrary distribution, it guarantees: (a) high accuracy over completed entries, and (b) high coverage of the underlying distribution. We also consider an online learning variant of this problem, where we propose a low-regret algorithm based on iterative gradient updates. Preliminary empirical evaluations are included.
Loss Minimization Yields Multicalibration for Large Neural Networks
Błasiok, Jarosław, Gopalan, Parikshit, Hu, Lunjia, Kalai, Adam Tauman, Nakkiran, Preetum
Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions. In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the predictors are neural networks of size $n > k$. We show that minimizing the squared loss over all neural nets of size $n$ implies multicalibration for all but a bounded number of unlucky values of $n$. We also give evidence that our bound on the number of unlucky values is tight, given our proof technique. Previously, results of the flavor that loss minimization yields multicalibration were known only for predictors that were near the ground truth, hence were rather limited in applicability. Unlike these, our results rely on the expressivity of neural nets and utilize the representation of the predictor.
Calibrated Language Models Must Hallucinate
Kalai, Adam Tauman, Vempala, Santosh S.
Recent language models generate false but plausible-sounding text with surprising frequency. Such "hallucinations" are an obstacle to the usability of language-based AI systems and can harm people who rely upon their outputs. This work shows shows that there is an inherent statistical lower-bound on the rate that pretrained language models hallucinate certain types of facts, having nothing to do with the transformer LM architecture or data quality. For "arbitrary" facts whose veracity cannot be determined from the training data, we show that hallucinations must occur at a certain rate for language models that satisfy a statistical calibration condition appropriate for generative language models. Specifically, if the maximum probability of any fact is bounded, we show that the probability of generating a hallucination is close to the fraction of facts that occur exactly once in the training data (a "Good-Turing" estimate), even assuming ideal training data without errors. One conclusion is that models pretrained to be sufficiently good predictors (i.e., calibrated) may require post-training to mitigate hallucinations on the type of arbitrary facts that tend to appear once in the training set. However, our analysis also suggests that there is no statistical reason that pretraining will lead to hallucination on facts that tend to appear more than once in the training data (like references to publications such as articles and books, whose hallucinations have been particularly notable and problematic) or on systematic facts (like arithmetic calculations). Therefore, different architectures and learning algorithms may mitigate these latter types of hallucinations.
Testing Language Model Agents Safely in the Wild
Naihin, Silen, Atkinson, David, Green, Marc, Hamadi, Merwane, Swift, Craig, Schonholtz, Douglas, Kalai, Adam Tauman, Bau, David
A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. We design a basic safety monitor (AgentMonitor) that is flexible enough to monitor existing LLM agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. Then we apply the AgentMonitor on a battery of real-world tests of AutoGPT, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable.
A Theory of Unsupervised Translation Motivated by Understanding Animal Communication
Goldwasser, Shafi, Gruber, David F., Kalai, Adam Tauman, Paradise, Orr
Neural networks are capable of translating between languages -- in some cases even between two languages where there is little or no access to parallel translations, in what is known as Unsupervised Machine Translation (UMT). Given this progress, it is intriguing to ask whether machine learning tools can ultimately enable understanding animal communication, particularly that of highly intelligent animals. We propose a theoretical framework for analyzing UMT when no parallel translations are available and when it cannot be assumed that the source and target corpora address related subject domains or posses similar linguistic structure. We exemplify this theory with two stylized models of language, for which our framework provides bounds on necessary sample complexity; the bounds are formally proven and experimentally verified on synthetic data. These bounds show that the error rates are inversely related to the language complexity and amount of common ground. This suggests that unsupervised translation of animal communication may be feasible if the communication system is sufficiently complex.
Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
Zelikman, Eric, Lorch, Eliana, Mackey, Lester, Kalai, Adam Tauman
Several recent advances in AI systems (e.g., Tree-of-Thoughts and Program-Aided Language Models) solve problems by providing a "scaffolding" program that structures multiple calls to language models to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a language-model-infused scaffolding program to improve itself. We start with a seed "improver" that improves an input program according to a given utility function by querying a language model several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver. A variety of self-improvement strategies are proposed by the language model, including beam search, genetic algorithms, and simulated annealing. Since the language models themselves are not altered, this is not full recursive self-improvement. Nonetheless, it demonstrates that a modern language model, GPT-4 in our proof-of-concept experiments, is capable of writing code that can call itself to improve itself. We consider concerns around the development of self-improving technologies and evaluate the frequency with which the generated code bypasses a sandbox. A language model can be queried to optimize virtually any objective describable in natural language. However, a program that makes multiple, structured calls to a language model can often produce outputs with higher objective values (Yao et al., 2022; 2023; Zelikman et al., 2023; Chen et al., 2022). We refer to these as "scaffolding" programs, typically written (by humans) in a programming language such as Python. Our key observation is that, for any distribution over optimization problems and any fixed language model, the design of a scaffolding program is itself an optimization problem. In this work, we introduce the Self-Taught Optimizer (STOP), a method in which code that applies a language model to improve arbitrary solutions is applied recursively to improve itself. Our approach begins with an initial seed'improver' scaffolding program that uses the language model to improve a solution to some downstream task.
Textbooks Are All You Need
Gunasekar, Suriya, Zhang, Yi, Aneja, Jyoti, Mendes, Caio César Teodoro, Del Giorno, Allie, Gopi, Sivakanth, Javaheripi, Mojan, Kauffmann, Piero, de Rosa, Gustavo, Saarikivi, Olli, Salim, Adil, Shah, Shital, Behl, Harkirat Singh, Wang, Xin, Bubeck, Sébastien, Eldan, Ronen, Kalai, Adam Tauman, Lee, Yin Tat, Li, Yuanzhi
We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.
Do Language Models Know When They're Hallucinating References?
Agrawal, Ayush, Suzgun, Mirac, Mackey, Lester, Kalai, Adam Tauman
State-of-the-art language models (LMs) are famous for "hallucinating" references. These fabricated article and book titles lead to harms, obstacles to their use, and public backlash. While other types of LM hallucinations are also important, we propose hallucinated references as the "drosophila" of research on hallucination in large language models (LLMs), as they are particularly easy to study. We show that simple search engine queries reliably identify such hallucinations, which facilitates evaluation. To begin to dissect the nature of hallucinated LM references, we attempt to classify them using black-box queries to the same LM, without consulting any external resources. Consistency checks done with "direct" queries about whether the generated reference title is real (inspired by Kadavath et al. 2022, Lin et al. 2022, Manakul et al. 2023) are compared to consistency checks with "indirect" queries which ask for ancillary details such as the authors of the work. These consistency checks are found to be partially reliable indicators of whether or not the reference is a hallucination. In particular, we find that LMs often hallucinate differing authors of hallucinated references when queried in independent sessions, while consistently identify authors of real references. This suggests that the hallucination may be more a generation issue than inherent to current training techniques or representation.