Explaining Neural Networks without Access to Training Data

Marton, Sascha, Lüdtke, Stefan, Bartelt, Christian, Tschalzev, Andrej, Stuckenschmidt, Heiner

arXiv.org Artificial Intelligence 

Artificial neural networks achieve impressive results for various modeling tasks [LeCun et al., 2015, Wang et al., 2020]. However, a downside of their superior performance and sophisticated structure is the comprehensibility of the learned models. In many domains, it is crucial to understand the function learned by a neural network, especially when it comes to decisions that affect people [Samek et al., 2019, Molnar, 2020]. A common approach to tackle the problem of interpretability without sacrificing the superior performance is using a surrogate model as gateway to interpretability [Molnar, 2020]. Most existing global surrogate approaches use a distillation procedure to learn the surrogate model based on the predictions of the neural network [Molnar, 2020, Frosst and Hinton, 2017]. Therefore, they query the neural network based on a representative set of samples and the resulting input-output pairs are then used to train the surrogate model. This representative sample usually comprises the training data of the original model, or at least follows its distribution [Molnar, 2020, Lopes et al., 2017]. However, there are many cases where the training data cannot easily be exposed due to privacy or safety concerns [Lopes et al., 2017, Bhardwaj et al., 2019, Nayak et al., 2019].