Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Jan-20-2025, 02:56:31 GMT–Neural Information Processing Systems

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call Boundless DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions.

identifying causal mechanism, interpretability, language model, (4 more...)

Neural Information Processing Systems

Jan-20-2025, 02:56:31 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.91)
  - Machine Learning > Neural Networks
    - Deep Learning (0.62)