Training Neural Networks for Modularity aids Interpretability
Golechha, Satvik, Cope, Dylan, Schoots, Nandi
–arXiv.org Artificial Intelligence
An approach to improve network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We find pretrained models to be highly unclusterable and thus train models to be more modular using an ``enmeshment loss'' function that encourages the formation of non-interacting clusters. Using automated interpretability measures, we show that our method finds clusters that learn different, disjoint, and smaller circuits for CIFAR-10 labels. Our approach provides a promising direction for making neural networks easier to interpret.
arXiv.org Artificial Intelligence
Sep-24-2024
- Country:
- Europe > Ireland
- Leinster > County Dublin > Dublin (0.04)
- North America > United States
- Louisiana > Orleans Parish > New Orleans (0.04)
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Europe > Ireland
- Genre:
- Research Report (0.83)
- Industry:
- Technology: