Language Models Can Predict Their Own Behavior

Jun-17-2026, 12:58:01 GMT–Neural Information Processing Systems

The text produced by language models (LMs) can exhibit specific'behaviors,' such as a failure to follow alignment training, that we hope to detect and react to during deployment. Identifying these behaviors can often only be done post facto, i.e., after the entire text of the output has been generated. We provide evidence that there are times when we can predict how an LM will behave early in computation, before even a single token is generated. We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence. Using methods from conformal prediction, we provide provable bounds on the estimation error of our probes, creating precise early warning systems for these behaviors.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Jun-17-2026, 12:58:01 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.93)
- Europe (0.67)
- North America > United States
  - Minnesota (0.28)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Commercial Services & Supplies > Security & Alarm Services (0.34)
- Education
  - Educational Setting (0.46)
  - Curriculum > Subject-Specific Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found