Goto

Collaborating Authors

 Chanin, David


A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

arXiv.org Artificial Intelligence

Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs) into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.


Analyzing the Generalization and Reliability of Steering Vectors -- ICML 2024

arXiv.org Artificial Intelligence

Steering vectors (SVs) are a new approach to efficiently adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain many technical difficulties of applying steering vectors to guide models' behaviour at scale.


Identifying Linear Relational Concepts in Large Language Models

arXiv.org Artificial Intelligence

Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any given human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts at a given hidden layer in a transformer LM by first modeling the relation between subject and object as a linear relational embedding (LRE). While the LRE work was mainly presented as an exercise in understanding model representations, we find that inverting the LRE while using earlier object layers results in a powerful technique to find concept directions that both work well as a classifier and causally influence model outputs.


Open-source Frame Semantic Parsing

arXiv.org Artificial Intelligence

Frame semantic parsing (Gildea and Jurafsky, 2002) is a natural language understanding (NLU) task involving finding structured semantic frames and their arguments from natural language text as formalized by the FrameNet project (Baker et al., 1998). Frame semantics has proved useful in understanding user intent from text, finding use in modern voice assistants (Chen et al., 2019), dialog systems (Chen et al., 2013), and even text analysis (Zhao et al., 2023). A semantic frame in FrameNet describes an event, relation, or situation and its participants. When a frame occurs in a sentence, there is typically a "trigger" word in the sentence which is said to evoke the frame. In addition, a frame contains a list of arguments known as frame elements which describe the semantic roles that pertain to the frame. A sample sentence parsed for frame and frame elements is shown in Figure 1. FrameNet provides a list of lexical units (LUs) for each frame, which are word senses with may evoke the frame when they occur in a sentence. For instance, the frame "Attack" has lexical units "ambush.n",


Neuro-symbolic Commonsense Social Reasoning

arXiv.org Artificial Intelligence

Social norms underlie all human social interactions, yet formalizing and reasoning with them remains a major challenge for AI systems. We present a novel system for taking social rules of thumb (ROTs) in natural language from the Social Chemistry 101 dataset and converting them to first-order logic where reasoning is performed using a neuro-symbolic theorem prover. We accomplish this in several steps. First, ROTs are converted into Abstract Meaning Representation (AMR), which is a graphical representation of the concepts in a sentence, and align the AMR with RoBERTa embeddings. We then generate alternate simplified versions of the AMR via a novel algorithm, recombining and merging embeddings for added robustness against different wordings of text, and incorrect AMR parses. The AMR is then converted into first-order logic, and is queried with a neuro-symbolic theorem prover. The goal of this paper is to develop and evaluate a neuro-symbolic method which performs explicit reasoning about social situations in a logical form.