Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

Zhang, Shichang, Han, Tessa, Bhalla, Usha, Lakkaraju, Himabindu

arXiv.org Artificial Intelligence 

As AI systems grow increasingly complex, understanding their behavior remains a critical challenge [Arrieta et al., 2020, Longo et al., 2024]. Researchers have developed methods to explain AI systems by attributing their behavior to three distinct aspects: input features, training data, and internal model components. Feature attribution methods identify influence of input features at test time, revealing which aspects of the input drive the model's output [Zeiler and Fergus, 2014, Ribeiro et al., 2016, Horel and Giesecke, 2020, 2022, Lundberg and Lee, 2017, Smilkov et al., 2017]. Data attribution analyzes how training data shape model behavior during the training phase [Koh and Liang, 2017, Ghorbani and Zou, 2019, Ilyas et al., 2022]. Component attribution examines the internal workings of the model by analyzing how specific components, such as neurons or layers in a neural network (NN), affect model behavior [Vig et al., 2020, Meng et al., 2022, Nanda, 2023, Shah et al., 2024]. While numerous attribution methods have been developed for each of these three aspects, and some survey papers have been published [Guidotti et al., 2018, Covert et al., 2021, Wang et al., 2024, Hammoudeh and Lowd, 2024, Bereska and Gavves, 2024], they have been studied and used rather independently by different communities, creating a fragmented landscape of methods and terminology for similar ideas [Saphra and Wiegreffe, 2024]. Our position is that feature, data, and component attribution methods can be bridged to advance not only interpretability research, by stimulating cross-aspect knowledge transfer, but also broader AI research, including model editing, steering, and regulation. We show that these three types of attribution employ common methods and they differ primarily in perspective rather than core techniques.