Toy Models of Superposition
Elhage, Nelson, Hume, Tristan, Olsson, Catherine, Schiefer, Nicholas, Henighan, Tom, Kravec, Shauna, Hatfield-Dodds, Zac, Lasenby, Robert, Drain, Dawn, Chen, Carol, Grosse, Roger, McCandlish, Sam, Kaplan, Jared, Amodei, Dario, Wattenberg, Martin, Olah, Christopher
–arXiv.org Artificial Intelligence
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
arXiv.org Artificial Intelligence
Sep-21-2022
- Genre:
- Research Report > New Finding (0.92)
- Industry:
- Health & Medicine > Therapeutic Area
- Neurology (0.46)
- Information Technology (0.45)
- Health & Medicine > Therapeutic Area
- Technology: