Paper

Governance Architecture for Neural Network Superposition: A Structural Solution to Hallucination via Routing and Interference Filtering

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

arXiv (Cornell University)Published 2022-09-21Paper linkPDF

Authors: Elhage, Nelson · Hume, Tristan · Olsson, Catherine · Schiefer, Nicholas · Henighan, Tom · Kravec, Shauna · Hatfield-Dodds, Zac · Lasenby, Robert · Drain, Dawn · Chen, Carol · Grosse, Roger · McCandlish, Sam · Kaplan, Jared · Amodei, Dario · Wattenberg, Martin · Olah, Christopher

Topics

Relevant entities

People

Related coverage

Linked coverage will appear here.

Related events

Linked events will appear here.

Related discussions

Related discussion nodes will appear here.