Identifying Linear Relational Concepts in Large Language Models
Chanin, David, Hunter, Anthony, Camburu, Oana-Maria
–arXiv.org Artificial Intelligence
Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any given human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts at a given hidden layer in a transformer LM by first modeling the relation between subject and object as a linear relational embedding (LRE). While the LRE work was mainly presented as an exercise in understanding model representations, we find that inverting the LRE while using earlier object layers results in a powerful technique to find concept directions that both work well as a classifier and causally influence model outputs.
arXiv.org Artificial Intelligence
Nov-15-2023
- Country:
- South America > Brazil
- Rio de Janeiro > Rio de Janeiro (0.04)
- North America
- Costa Rica (0.04)
- Dominican Republic (0.04)
- United States
- New York (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- Panama > Panama
- Panama City (0.04)
- Europe
- France (0.06)
- Germany (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- Russia > Northwestern Federal District
- Leningrad Oblast > Saint Petersburg (0.04)
- Asia
- Russia (0.04)
- Malaysia > Kuala Lumpur
- Kuala Lumpur (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- China > Shanghai
- Shanghai (0.04)
- Africa > South Africa
- Gauteng > Johannesburg (0.04)
- South America > Brazil
- Genre:
- Research Report (1.00)
- Technology: