The Logical Implication Steering Method for Conditional Interventions on Transformer Generation
–arXiv.org Artificial Intelligence
The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.
arXiv.org Artificial Intelligence
Feb-5-2025
- Country:
- Asia > Thailand
- Europe
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- Ireland > Leinster
- North America > United States
- Georgia > Fulton County > Atlanta (0.04)
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Genre:
- Research Report > New Finding (0.88)
- Industry:
- Information Technology (0.46)
- Technology: