The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

Feb-5-2025–arXiv.org Artificial Intelligence

The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.

conditional intervention, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Feb-5-2025

arXiv.org PDF

Add feedback

Country:
- South America > Colombia
  - Meta Department > Villavicencio (0.04)
- North America > United States
  - Georgia > Fulton County > Atlanta (0.04)
- Europe
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
- Asia > Thailand
  - Bangkok > Bangkok (0.04)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Cognitive Science > Problem Solving (0.86)
  - Machine Learning > Performance Analysis
    - Accuracy (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found