Goto

Collaborating Authors

 interpretability method




Learning outside the Black-Box: The pursuit of interpretable models

Neural Information Processing Systems

Machine Learning has proved its ability to produce accurate models - but the deployment of these models outside the machine learning community has been hindered by the difficulties of interpreting these models.


FIND: A Function Description Benchmark for Evaluating Interpretability Methods

Neural Information Processing Systems

Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools?


Evaluating the Robustness of Interpretability Methods through Explanation Invariance and Equivariance

Neural Information Processing Systems

Interpretability methods are valuable only if their explanations faithfully describe the explained model. In this work, we consider neural networks whose predictions are invariant under a specific symmetry group. This includes popular architectures, ranging from convolutional to graph neural networks. Any explanation that faithfully explains this type of model needs to be in agreement with this invariance property.