Predicting the Performance of Black-box LLMs through Self-Queries

Sam, Dylan, Finzi, Marc, Kolter, J. Zico

arXiv.org Artificial Intelligence 

As large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. While a great deal of work in the field uses internal representations to interpret model behavior, these representations are inaccessible when given solely black-box access through an API. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations to train reliable predictors of model behavior. We demonstrate that training a linear model on these low-dimensional representations produces reliable and generalizable predictors of model performance at the instance level (e.g., if a particular generation correctly answers a question). Remarkably, these can often outperform white-box linear predictors that operate over a model's hidden state or the full distribution over its vocabulary. In addition, we demonstrate that these extracted features can be used to evaluate more nuanced aspects of a language model's state. For instance, they can be used to distinguish between a clean version of GPT-4o-mini and a version that has been influenced via an adversarial system prompt that answers question-answering tasks incorrectly or introduces bugs into generated code. Furthermore, they can reliably distinguish between different model architectures and sizes, enabling the detection of misrepresented models provided through an API (e.g., identifying if GPT-3.5 is supplied instead of GPT-4o-mini). Large language models (LLMs) have demonstrated strong performance on a wide variety of tasks (Radford et al.), leading to their increased involvement in larger systems. For instance, they are often used to provide supervision (Bai et al., 2022; Sam et al., 2024), as tools in decision-making (Benary et al., 2023; Sha et al., 2023), or as controllers on agentic frameworks (Xi et al., 2023; Robey et al., 2024). Thus, it is crucial to understand and predict their behaviors, especially in high-stakes settings. However, as with any deep network, it is difficult to understand the behavior of such large models (Zhang et al., 2021). For instance, prior work has studied input gradients or saliency maps (Simonyan et al., 2013; Zeiler & Fergus, 2014; Pukdee et al., 2024)) to attempt to understand neural network behavior, but this can fail to reliably describe model behavior (Adebayo et al., 2018; Kindermans et al., 2019; Srinivas & Fleuret, 2020). Other work has studied the ability of transformers to represent certain algorithms (Nanda et al., 2022; Zhong et al., 2024) that may be involved in their predictions. One promising direction in understanding LLMs (or any other multimodal model that understands natural language) is to leverage their ability to interact with human queries. Recent work has demonstrated that a LLM's hidden state contains low-dimensional features of model truthfulness or harmfulness (Zou et al., 2023a).