Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Huang, Jing, Tao, Junyi, Icard, Thomas, Yang, Diyi, Potts, Christopher
Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.
May-20-2025
- Country:
- Africa > Zimbabwe (0.04)
- Asia
- Europe
- Austria > Vienna (0.14)
- Denmark > Capital Region
- Copenhagen (0.04)
- Italy (0.04)
- North America
- Panama (0.04)
- United States
- California > Santa Clara County
- Palo Alto (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California > Santa Clara County
- South America > Suriname (0.05)
- Genre:
- Research Report (0.83)
- Technology: