Interpretability Illusions in the Generalization of Simplified Models
Friedman, Dan, Lampinen, Andrew, Dixon, Lucas, Chen, Danqi, Ghandeharioun, Asma
–arXiv.org Artificial Intelligence
A common method to study deep learning systems is to use simplified model representations--for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplifications are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution--the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits. First, we train models on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model on various out-of-distribution test sets. We find that the simplified proxies are generally less faithful out of distribution. In cases where the original model generalizes to novel structures or deeper depths, the simplified versions may fail, or generalize better. This finding holds even if the simplified representations do not directly depend on the training distribution. Next, we study a more naturalistic task: predicting the next character in a dataset of computer code. We find similar generalization gaps between the original model and simplified proxies, and conduct further analysis to investigate which aspects of the code completion task are associated with the largest gaps. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations. How can we understand deep learning models? Often, we begin by simplifying the model, or its representations, using tools like dimensionality reduction, clustering, and discretization. We then interpret the results of these simplifications--for example finding dimensions in the principal components that encode a task-relevant feature (e.g. In other words, we are essentially replacing the original model with a simplified proxy which uses a more limited--and thus easier to interpret--set of features. By analyzing these simplified proxies, we hope to understand at an abstract level how the system solves a task. Ideally, this understanding could help us to predict how the model will behave in unfamiliar situations, and thereby anticipate failure cases or potentially unsafe behavior.
arXiv.org Artificial Intelligence
Dec-6-2023
- Country:
- North America > United States (0.14)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine > Therapeutic Area (0.46)
- Technology: