patchscope
Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Li, Millicent, Arroyo, Alberto Mario Ceballos, Rogers, Giordano, Saphra, Naomi, Wallace, Byron C.
Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE (0.14)
- Oceania > New Zealand (0.04)
- (26 more...)
- Education (0.67)
- Leisure & Entertainment > Games > Computer Games (0.45)
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
Understanding and interpreting the internal representations of large language models (LLMs) remains an open challenge. Patchscopes introduced a method for probing internal activations by patching them into new prompts, prompting models to self-explain their hidden representations. We introduce Superscopes, a technique that systematically amplifies superposed features in MLP outputs (multilayer perceptron) and hidden states before patching them into new contexts. Inspired by the "features as directions" perspective and the Classifier-Free Guidance (CFG) approach from diffusion models, Superscopes amplifies weak but meaningful features, enabling the interpretation of internal representations that previous methods failed to explain-all without requiring additional training. This approach provides new insights into how LLMs build context and represent complex concepts, further advancing mechanistic interpretability.
- Europe > United Kingdom > Wales (0.05)
- Asia > Singapore (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (4 more...)
- Media (0.46)
- Leisure & Entertainment (0.46)
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
Biran, Eden, Gottesman, Daniela, Yang, Sohee, Geva, Mor, Globerson, Amir
Large language models (LLMs) can solve complex multi-step problems, but little is known about how these computations are implemented internally. Motivated by this, we study how LLMs answer multi-hop queries such as "The spouse of the performer of Imagine is". These queries require two information extraction steps: a latent one for resolving the first hop ("the performer of Imagine") into the bridge entity (John Lennon), and one for resolving the second hop ("the spouse of John Lennon") into the target entity (Yoko Ono). Understanding how the latent step is computed internally is key to understanding the overall computation. By carefully analyzing the internal computations of transformer-based LLMs, we discover that the bridge entity is resolved in the early layers of the model. Then, only after this resolution, the two-hop query is solved in the later layers. Because the second hop commences in later layers, there could be cases where these layers no longer encode the necessary knowledge for correctly predicting the answer. Motivated by this, we propose a novel "back-patching" analysis method whereby a hidden representation from a later layer is patched back to an earlier layer. We find that in up to 57% of previously incorrect cases there exists a back-patch that results in the correct generation of the answer, showing that the later layers indeed sometimes lack the needed functionality. Overall our methods and findings open further opportunities for understanding and improving latent reasoning in transformer-based LLMs.
- Asia > Singapore (0.05)
- North America > Dominican Republic (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (5 more...)
- Media (0.69)
- Leisure & Entertainment (0.55)
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Ghandeharioun, Asma, Caciularu, Avi, Pearce, Adam, Dixon, Lucas, Geva, Mor
Inspecting the information encoded in hidden representations of large language models (LLMs) can explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > United Kingdom > Wales (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (13 more...)
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.46)
- Leisure & Entertainment (1.00)
- Government > Regional Government (0.68)
- Media > Film (0.67)