Predicting the Performance of Black box Language Models with Follow up Queries

Jun-23-2026, 00:57:58 GMT–Neural Information Processing Systems

Reliably predicting the behavior of language models--such as whether their outputs are correct or have been adversarially manipulated--is a fundamentally challenging task. This is often made even more difficult as frontier language models are offered only through closed-source APIs, providing only black-box access. In this paper, we predict the behavior of black-box language models by asking follow-up questions and taking the probabilities of responses as representations to train reliable predictors. We first demonstrate that training a linear model on these responses reliably and accurately predicts model correctness on question-answering and reasoning benchmarks. Surprisingly, this can even outperform white-box linear predictors that operate over model internals or activations. Furthermore, we demonstrate that these follow-up question responses can reliably distinguish between a clean version of an LLM and one that has been adversarially influenced via a system prompt to answer questions incorrectly or to introduce bugs into generated code. Finally, we show that they can also be used to differentiate between blackbox LLMs, enabling the detection of misrepresented models provided through an API. Overall, our work shows promise in monitoring black-box language model behavior, supporting their deployment in larger, autonomous systems.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Jun-23-2026, 00:57:58 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Transportation > Air (1.00)
- Information Technology (0.92)
- Government (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found