Gomez-Rodriguez, Manuel
Evaluation of Large Language Models via Coupled Token Generation
Benz, Nina Corvelo, Tsirtsis, Stratis, Straitouri, Eleni, Chatzi, Ivi, Velasco, Ander Artola, Thejaswi, Suhas, Gomez-Rodriguez, Manuel
State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama family. We find that, across multiple knowledge areas from the popular MMLU benchmark dataset, coupled autoregressive generation requires up to 40% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, using data from the LMSYS Chatbot Arena platform, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts differ under coupled and vanilla autoregressive generation.
Counterfactual Token Generation in Large Language Models
Chatzi, Ivi, Benz, Nina Corvelo, Straitouri, Eleni, Tsirtsis, Stratis, Gomez-Rodriguez, Manuel
"Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself." Although this story, generated by a large language model, is captivating, one may wonder -- how would the story have unfolded if the model had chosen "Captain Maeve" as the protagonist instead? We cannot know. State-of-the-art large language models are stateless -- they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-Instruct and Ministral-8B-Instruct and conduct a qualitative and a quantitative analysis of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.
Towards Human-AI Complementarity with Predictions Sets
De Toni, Giovanni, Okati, Nastaran, Thejaswi, Suhas, Straitouri, Eleni, Gomez-Rodriguez, Manuel
In recent years, there has been increasing excitement about the potential of decision support systems based on machine learning to help human experts make more accurate predictions in a variety of application domains, including medicine, education and science [1-3]. In this context, the ultimate goal is human-AI complementarity--the predictions made by the human expert who uses a decision support system are more accurate than the predictions made by the expert on their own and by the classifier used by the decision support system [4-8]. The conventional wisdom is that to achieve human-AI complementarity, decision support systems should help humans understand when and how to use their predictions to update their own. As a result, a flurry of empirical studies has analyzed how factors such as confidence, explanations, or calibration influence when and how humans use the predictions provided by a decision support system [9-12]. Unfortunately, these studies have been so far inconclusive and it is yet unclear how to design decision support systems that achieve human-AI complementarity [13-17]. In this context, Straitouri et al. [18, 19] have recently argued, both theoretically and empirically, that an alternative type of decision support systems may achieve human-AI complementarity, by design. Rather than providing a single label prediction and letting a human expert decide when and how to use the predicted label to update their own prediction, these systems provide a set of label predictions, namely a prediction set, and ask the expert to predict a label value from the set.
Matchings, Predictions and Counterfactual Harm in Refugee Resettlement Processes
Lee, Seungeon, Benz, Nina Corvelo, Thejaswi, Suhas, Gomez-Rodriguez, Manuel
Resettlement agencies have started to adopt data-driven algorithmic matching to match refugees to locations using employment rate as a measure of utility. Given a pool of refugees, data-driven algorithmic matching utilizes a classifier to predict the probability that each refugee would find employment at any given location. Then, it uses the predicted probabilities to estimate the expected utility of all possible placement decisions. Finally, it finds the placement decisions that maximize the predicted utility by solving a maximum weight bipartite matching problem. In this work, we argue that, using existing solutions, there may be pools of refugees for which data-driven algorithmic matching is (counterfactually) harmful -- it would have achieved lower utility than a given default policy used in the past, had it been used. Then, we develop a post-processing algorithm that, given placement decisions made by a default policy on a pool of refugees and their employment outcomes, solves an inverse~matching problem to minimally modify the predictions made by a given classifier. Under these modified predictions, the optimal matching policy that maximizes predicted utility on the pool is guaranteed to be not harmful. Further, we introduce a Transformer model that, given placement decisions made by a default policy on multiple pools of refugees and their employment outcomes, learns to modify the predictions made by a classifier so that the optimal matching policy that maximizes predicted utility under the modified predictions on an unseen pool of refugees is less likely to be harmful than under the original predictions. Experiments on simulated resettlement processes using synthetic refugee data created from a variety of publicly available data suggest that our methodology may be effective in making algorithmic placement decisions that are less likely to be harmful than existing solutions.
Finding Counterfactually Optimal Action Sequences in Continuous State Spaces
Tsirtsis, Stratis, Gomez-Rodriguez, Manuel
Whenever a clinician reflects on the efficacy of a sequence of treatment decisions for a patient, they may try to identify critical time steps where, had they made different decisions, the patient's health would have improved. While recent methods at the intersection of causal inference and reinforcement learning promise to aid human experts, as the clinician above, to retrospectively analyze sequential decision making processes, they have focused on environments with finitely many discrete states. However, in many practical applications, the state of the environment is inherently continuous in nature. In this paper, we aim to fill this gap. We start by formally characterizing a sequence of discrete actions and continuous states using finite horizon Markov decision processes and a broad class of bijective structural causal models. Building upon this characterization, we formalize the problem of finding counterfactually optimal action sequences and show that, in general, we cannot expect to solve it in polynomial time. Then, we develop a search method based on the $A^*$ algorithm that, under a natural form of Lipschitz continuity of the environment's dynamics, is guaranteed to return the optimal solution to the problem. Experiments on real clinical data show that our method is very efficient in practice, and it has the potential to offer interesting insights for sequential decision making tasks.
Counterfactual Explanations in Sequential Decision Making Under Uncertainty
Tsirtsis, Stratis, De, Abir, Gomez-Rodriguez, Manuel
Methods to find counterfactual explanations have predominantly focused on one step decision making processes. In this work, we initiate the development of methods to find counterfactual explanations for decision making processes in which multiple, dependent actions are taken sequentially over time. We start by formally characterizing a sequence of actions and states using finite horizon Markov decision processes and the Gumbel-Max structural causal model. Building upon this characterization, we formally state the problem of finding counterfactual explanations for sequential decision making processes. In our problem formulation, the counterfactual explanation specifies an alternative sequence of actions differing in at most k actions from the observed sequence that could have led the observed process realization to a better outcome. Then, we introduce a polynomial time algorithm based on dynamic programming to build a counterfactual policy that is guaranteed to always provide the optimal counterfactual explanation on every possible realization of the counterfactual environment dynamics. We validate our algorithm using both synthetic and real data from cognitive behavioral therapy and show that the counterfactual explanations our algorithm finds can provide valuable insights to enhance sequential decision making under uncertainty.
Differentiable Learning Under Triage
Okati, Nastaran, De, Abir, Gomez-Rodriguez, Manuel
Multiple lines of evidence suggest that predictive models may benefit from algorithmic triage. Under algorithmic triage, a predictive model does not predict all instances but instead defers some of them to human experts. However, the interplay between the prediction accuracy of the model and the human experts under algorithmic triage is not well understood. In this work, we start by formally characterizing under which circumstances a predictive model may benefit from algorithmic triage. In doing so, we also demonstrate that models trained for full automation may be suboptimal under triage. Then, given any model and desired level of triage, we show that the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically by thresholding the difference between the model and human errors on a per-instance level. Building upon these results, we introduce a practical gradient-based algorithm that is guaranteed to find a sequence of triage policies and predictive models of increasing performance. Experiments on a wide variety of supervised learning tasks using synthetic and real data from two important applications -- content moderation and scientific discovery -- illustrate our theoretical results and show that the models and triage policies provided by our gradient-based algorithm outperform those provided by several competitive baselines.
Large-scale randomized experiment reveals machine learning helps people learn and remember more effectively
Upadhyay, Utkarsh, Lancashire, Graham, Moser, Christoph, Gomez-Rodriguez, Manuel
The greater degree of control and personalization offered by learning apps and online platforms promise to facilitate the design and implementation of automated, data-driven teaching policies that adapt to each learner's knowledge over time, improving upon the traditional one-size-fits-all human instruction. However, to fulfill this promise, it is necessary to develop adaptive data-driven models of the learners, which accurately quantify their knowledge, and efficient methods to find teaching policies that are provably optimal under the learners' models [1, 2]. In this context, research in the (theoretical) computer science literature has been typically focused on finding teaching policies that enjoy optimality guarantees under simplified mathematical models of the learner's knowledge [3-7]. In contrast, research in cognitive sciences has focused on measuring the effectivity of a variety of heuristic teaching policies informed by psychologically valid models of the learner's knowledge using (small) randomized control trials [8-11]. Only very recently, Tabibian et al. [12] has introduced a machine learning modeling framework that bridges the gap between both lines of research--their framework can be used to determine the optimal rate of study a learner should follow under a model of the learner's memory state that is informed by real human memory data. However, in the evaluation of their framework, the authors resort to a natural experiment using data from a popular language-learning online platform rather than a randomized control trial, the gold standard in the cognitive sciences literature. As a result, it has been argued that, in an interventional setting, an actual learner following the rate of study may fail to achieve optimal performance [1]. In this paper, we build upon the modeling framework of Tabibian et al. [12] and design Select, a simple, efficient and adaptive machine learning algorithm with theoretical guarantees to determine which questions to include in a learner's sessions of study over time, rather than optimizing the rate of study as in Tabibian et al.,
Classification Under Human Assistance
De, Abir, Okati, Nastaran, Zarezade, Ali, Gomez-Rodriguez, Manuel
Most supervised learning models are trained for full automation. However, their predictions are sometimes worse than those by human experts on some specific instances. Motivated by this empirical observation, our goal is to design classifiers that are optimized to operate under different automation levels. More specifically, we focus on convex margin-based classifiers and first show that the problem is NP-hard. Then, we further show that, for support vector machines, the corresponding objective function can be expressed as the difference of two functions f = g - c, where g is monotone, non-negative and {\gamma}-weakly submodular, and c is non-negative and modular. This representation allows a recently introduced deterministic greedy algorithm, as well as a more efficient randomized variant of the algorithm, to enjoy approximation guarantees at solving the problem. Experiments on synthetic and real-world data from several applications in medical diagnosis illustrate our theoretical findings and demonstrate that, under human assistance, supervised learning models trained to operate under different automation levels can outperform those trained for full automation as well as humans operating alone.
Can A User Anticipate What Her Followers Want?
De, Abir, Singla, Adish, Upadhyay, Utkarsh, Gomez-Rodriguez, Manuel
Whenever a social media user decides to share a story, she is typically pleased to receive likes, comments, shares, or, more generally, feedback from her followers. As a result, she may feel compelled to use the feedback she receives to (re-)estimate her followers' preferences and decides which stories to share next to receive more (positive) feedback. Under which conditions can she succeed? In this work, we first look into this problem from a theoretical perspective and then provide a set of practical algorithms to identify and characterize such behavior in social media. More specifically, we address the above problem from the viewpoint of sequential decision making and utility maximization. For a wide variety of utility functions, we first show that, to succeed, a user needs to actively trade off exploitation-- sharing stories which lead to more (positive) feedback--and exploration-- sharing stories to learn about her followers' preferences. However, exploration is not necessary if a user utilizes the feedback her followers provide to other users in addition to the feedback she receives. Then, we develop a utility estimation framework for observation data, which relies on statistical hypothesis testing to determine whether a user utilizes the feedback she receives from each of her followers to decide what to post next. Experiments on synthetic data illustrate our theoretical findings and show that our estimation framework is able to accurately recover users' underlying utility functions. Experiments on several real datasets gathered from Twitter and Reddit reveal that up to 82% (43%) of the Twitter (Reddit) users in our datasets do use the feedback they receive to decide what to post next.