Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
–arXiv.org Artificial Intelligence
Understanding and interpreting the internal representations of large language models (LLMs) remains an open challenge. Patchscopes introduced a method for probing internal activations by patching them into new prompts, prompting models to self-explain their hidden representations. We introduce Superscopes, a technique that systematically amplifies superposed features in MLP outputs (multilayer perceptron) and hidden states before patching them into new contexts. Inspired by the "features as directions" perspective and the Classifier-Free Guidance (CFG) approach from diffusion models, Superscopes amplifies weak but meaningful features, enabling the interpretation of internal representations that previous methods failed to explain-all without requiring additional training. This approach provides new insights into how LLMs build context and represent complex concepts, further advancing mechanistic interpretability.
arXiv.org Artificial Intelligence
Mar-9-2025
- Country:
- Asia
- Indonesia > Bali (0.04)
- Middle East > Oman (0.04)
- Singapore (0.04)
- Europe > United Kingdom
- Wales (0.05)
- North America > United States
- California (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.93)
- Industry:
- Leisure & Entertainment (0.46)
- Media (0.46)
- Technology: