Caruana, Rich
GAMformer: In-Context Learning for Generalized Additive Models
Mueller, Andreas, Siems, Julien, Nori, Harsha, Salinas, David, Zela, Arber, Caruana, Rich, Hutter, Frank
Generalized Additive Models (GAMs) are widely recognized for their ability to create fully interpretable machine learning models for tabular data. Traditionally, training GAMs involves iterative learning algorithms, such as splines, boosted trees, or neural networks, which refine the additive components through repeated error reduction. In this paper, we introduce GAMformer, the first method to leverage in-context learning to estimate shape functions of a GAM in a single forward pass, representing a significant departure from the conventional iterative approaches to GAM fitting. Building on previous research applying in-context learning to tabular data, we exclusively use complex, synthetic data to train GAMformer, yet find it extrapolates well to real-world data. Our experiments show that GAMformer performs on par with other leading GAMs across various classification benchmarks while generating highly interpretable shape functions. The growing importance of interpretability in machine learning is evident, especially in areas where transparency, fairness, and accountability are critical (Barocas and Selbst, 2016; Rudin et al., 2022). Interpretable models are essential for building trust between humans and AI systems by allowing users to understand the reasoning behind the model's predictions and decisions (Ribeiro et al., 2016). This is crucial in safety-critical fields like healthcare, where incorrect or biased decisions can have severe consequences (Caruana et al., 2015). Additionally, interpretability is vital for regulatory compliance in sectors like finance and hiring, where explaining and justifying model outcomes is necessary (Arun et al., 2016; Dattner et al., 2019). Interpretable models also help detect and mitigate bias by revealing the factors influencing predictions, ensuring fair and unbiased decisions across different population groups (Mehrabi et al., 2021). Generalized Additive Models (GAMs) have proven a popular choice for interpretable modeling due to their high accuracy and interpretability. In GAMs, the target variable is expressed as a sum of non-linearly transformed features.
Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models
Bordt, Sebastian, Nori, Harsha, Rodrigues, Vanessa, Nushi, Besmira, Caruana, Rich
While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. Without fine-tuning, we find them to be limited. This suggests that much of the few-shot performance on novel datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We make the exposure tests we developed available as the tabmemcheck Python package at https://github.com/interpretml/LLM-Tabular-Memorization-Checker
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
Bordt, Sebastian, Nori, Harsha, Caruana, Rich
While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset verbatim. On these datasets, although seen during training, good performance on downstream tasks might not be due to overfitting. Our findings underscore the need for ensuring data integrity in machine learning tasks with LLMs. To facilitate future research, we release an open-source tool that can perform various tests for memorization https://github.com/interpretml/
Data Science with LLMs and Interpretable Models
Bordt, Sebastian, Lengerich, Ben, Nori, Harsha, Caruana, Rich
Recent years have seen important advances in the building of interpretable models, machine learning models that are designed to be easily understood by humans. In this work, we show that large language models (LLMs) are remarkably good at working with interpretable models, too. In particular, we show that LLMs can describe, interpret, and debug Generalized Additive Models (GAMs). Combining the flexibility of LLMs with the breadth of statistical patterns accurately described by GAMs enables dataset summarization, question answering, and model critique. LLMs can also improve the interaction between domain experts and interpretable models, and generate hypotheses about the underlying phenomenon. We release \url{https://github.com/interpretml/TalkToEBM} as an open-source LLM-GAM interface.
Rethinking Interpretability in the Era of Large Language Models
Singh, Chandan, Inala, Jeevana Priya, Galley, Michel, Caruana, Rich, Gao, Jianfeng
Interpretable machine learning has exploded as an area of interest over the last decade, sparked by the rise of increasingly large datasets and deep neural networks. Simultaneously, large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks, offering a chance to rethink opportunities in interpretable machine learning. Notably, the capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. However, these new capabilities raise new challenges, such as hallucinated explanations and immense computational costs. In this position paper, we start by reviewing existing methods to evaluate the emerging field of LLM interpretation (both interpreting LLMs and using LLMs for explanation). We contend that, despite their limitations, LLMs hold the opportunity to redefine interpretability with a more ambitious scope across many applications, including in auditing LLMs themselves. We highlight two emerging research priorities for LLM interpretation: using LLMs to directly analyze new datasets and to generate interactive explanations.
Extending Explainable Boosting Machines to Scientific Image Data
Schug, Daniel, Yerramreddy, Sai, Caruana, Rich, Greenberg, Craig, Zwolak, Justyna P.
As the deployment of computer vision technology becomes increasingly common in science, the need for explanations of the system and its output has become a focus of great concern. Driven by the pressing need for interpretable models in science, we propose the use of Explainable Boosting Machines (EBMs) for scientific image data. Inspired by an important application underpinning the development of quantum technologies, we apply EBMs to cold-atom soliton image data tabularized using Gabor Wavelet Transform-based techniques that preserve the spatial structure of the data. In doing so, we demonstrate the use of EBMs for image data for the first time and show that our approach provides explanations that are consistent with human intuition about the data.
Explaining high-dimensional text classifiers
Melamed, Odelia, Caruana, Rich
Explainability has become a valuable tool in the last few years, helping humans better understand AI-guided decisions. However, the classic explainability tools are sometimes quite limited when considering high-dimensional inputs and neural network classifiers. We present a new explainability method using theoretically proven high-dimensional properties in neural network classifiers.
Interpretable Predictive Models to Understand Risk Factors for Maternal and Fetal Outcomes
Bosschieter, Tomas M., Xu, Zifei, Lan, Hui, Lengerich, Benjamin J., Nori, Harsha, Painter, Ian, Souter, Vivienne, Caruana, Rich
Although most pregnancies result in a good outcome, complications are not uncommon and can be associated with serious implications for mothers and babies. Predictive modeling has the potential to improve outcomes through better understanding of risk factors, heightened surveillance for high risk patients, and more timely and appropriate interventions, thereby helping obstetricians deliver better care. We identify and study the most important risk factors for four types of pregnancy complications: (i) severe maternal morbidity, (ii) shoulder dystocia, (iii) preterm preeclampsia, and (iv) antepartum stillbirth. We use an Explainable Boosting Machine (EBM), a high-accuracy glass-box learning method, for prediction and identification of important risk factors. We undertake external validation and perform an extensive robustness analysis of the EBM models. EBMs match the accuracy of other black-box ML methods such as deep neural networks and random forests, and outperform logistic regression, while being more interpretable. EBMs prove to be robust. The interpretability of the EBM models reveals surprising insights into the features contributing to risk (e.g. maternal height is the second most important feature for shoulder dystocia) and may have potential for clinical application in the prediction and prevention of serious complications in pregnancy.
LLMs Understand Glass-Box Models, Discover Surprises, and Suggest Repairs
Lengerich, Benjamin J., Bordt, Sebastian, Nori, Harsha, Nunnally, Mark E., Aphinyanaphongs, Yin, Kellis, Manolis, Caruana, Rich
Large language models (LLMs) offer the potential to automate data science through natural language interfaces, but it is difficult to embed complex models or datasets in confined context windows. While GPT-4 has a context window size of up to 32k tokens, paying equal attention to all parts of the context remains a challenge [1] and the practicality of lengthy context windows is questionable. Machine learning models often involve billions of parameters, accentuating the need for compact, modular function representations that more easily interface with LLMs. In this paper, we show that LLMs pair remarkably well with interpretable models that are decomposable into modular components. Specifically, we show that GPT-4 is able to describe, interpret and debug univariate graphs, and by applying a form of chain-of-thought reasoning[2], GPT-4 can understand Generalized Additive Models (GAMs). GAMs [3, 4] represent complex outcomes as sums of univariate component functions (graphs); thus, by analyzing each of these component functions in turn, the LLM does not need to understand the entire model at once. After analyzing and summarizing each graph, the LLM can operate on component summaries to produce model-level analyses. This modularity simplifies the application of LLMs to data science and machine learning and enables LLM-based analyses to scale to very large datasets while staying within small context windows.
Diagnosis Uncertain Models For Medical Risk Prediction
Peysakhovich, Alexander, Caruana, Rich, Aphinyanaphongs, Yin
In-hospital patient outcome prediction is a major research area at the intersection of machine learning and medicine [Barfod et al., 2012, Taylor et al., 2016, Brajer et al., 2020, Naemi et al., 2021, Soffer et al., 2021, Wiesenfeld et al., 2022]. An important application of such models is'early' risk prediction - for example, using risk scores for triage [Raita et al., 2019, Klug et al., 2020]. Early prediction often requires calculating patient risk when primary diagnosis is still unknown or uncertain. We propose a method for incorporating uncertainty about diagnosis into mortality risk assessments in an interpretable and actionable way. We study the problem of all-cause in-hospital mortality prediction in the MIMIC-IV dataset [Johnson et al., 2023]. We find that a single model which pools all data and ignores diagnoses (we refer to this as the all-cause model or ACM) performs better at prediction than diagnosis-specific modeling. This increase in performance comes from the fact that the ACM has access to more data (so has lower variance) and that there is substantial transferrability in risk across diagnoses (so the ACM bias is not that high). We see this even more starkly by showing that a model trained only on out-of-diagnosis data can, due to this logic, predict risk within a diagnosis just as well as a model trained on that diagnosis only. While ACM are on average quite performant, we find that there are cases where they can fail.