Identifying Linear Relational Concepts in Large Language Models

Chanin, David, Hunter, Anthony, Camburu, Oana-Maria

Nov-15-2023–arXiv.org Artificial Intelligence

Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any given human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts at a given hidden layer in a transformer LM by first modeling the relation between subject and object as a linear relational embedding (LRE). While the LRE work was mainly presented as an exercise in understanding model representations, we find that inverting the LRE while using earlier object layers results in a powerful technique to find concept directions that both work well as a classifier and causally influence model outputs.

activation, causality, relation, (15 more...)

arXiv.org Artificial Intelligence

Nov-15-2023

arXiv.org PDF

Add feedback

Country:
- South America > Brazil
  - Rio de Janeiro > Rio de Janeiro (0.04)
- North America
  - Costa Rica (0.04)
  - Dominican Republic (0.04)
  - United States
    - New York (0.04)
    - Nevada > Clark County
      - Las Vegas (0.04)
  - Panama > Panama
    - Panama City (0.04)
- Europe
  - France (0.06)
  - Germany (0.04)
  - United Kingdom > England
    - Greater London > London (0.04)
  - Russia > Northwestern Federal District
    - Leningrad Oblast > Saint Petersburg (0.04)
- Asia
  - Russia (0.04)
  - Malaysia > Kuala Lumpur
    - Kuala Lumpur (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
  - China > Shanghai
    - Shanghai (0.04)
- Africa > South Africa
  - Gauteng > Johannesburg (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.75)
    - Statistical Learning > Support Vector Machines (0.46)