Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Huang, Jing, Tao, Junyi, Icard, Thomas, Yang, Diyi, Potts, Christopher

May-20-2025–arXiv.org Machine Learning

Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.

large language model, machine learning, mechanism, (20 more...)

arXiv.org Machine Learning

May-20-2025

arXiv.org PDF

Add feedback

Country:
- South America > Suriname (0.05)
- Africa > Zimbabwe (0.04)
- North America
  - Panama (0.04)
  - United States
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Florida > Miami-Dade County
      - Miami (0.04)
    - California > Santa Clara County
      - Palo Alto (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Italy (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - Japan (0.04)
  - Singapore (0.04)
  - China (0.04)
  - Indonesia > Bali (0.04)
  - India (0.04)
  - Taiwan (0.04)
  - Vietnam (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
  - Middle East
    - Palestine (0.04)
    - Israel (0.04)

Genre:
- Research Report (0.83)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.89)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found