Interpretable deep learning is a fundamental building block towards safer AI, especially when the deployment possibilities of deep learning-based computer-aided medical diagnostic systems are so eminent. However, without a computational formulation of black-box interpretation, general interpretability research rely heavily on subjective bias. Clear decision structure of the medical diagnostics lets us approximate the decision process of a radiologist as a model - removed from subjective bias. We define the process of interpretation as a finite communication between a known model and a black-box model to optimally map the black box's decision process in the known model. Consequently, we define interpretability as maximal information gain over the initial uncertainty about the black-box's decision within finite communication. We relax this definition based on the observation that diagnostic interpretation is typically achieved by a process of minimal querying. We derive an algorithm to calculate diagnostic interpretability. The usual question of accuracy-interpretability tradeoff, i.e. whether a black-box model's prediction accuracy is dependent on its ability to be interpreted by a known source model, does not arise in this theory. With multiple example simulation experiments of various complexity levels, we demonstrate the working of such a theoretical model in synthetic supervised classification scenarios.
Any sufficiently advanced technology is indistinguishable from magic. In the world of artificial intelligence & machine learning (AI & ML), black- and white-box categorization of models and algorithms refers to their interpretability. That is, given a model trained to map data inputs to outputs (e.g. And just as the software testing dichotomy is high-level behavior vs low-level logic, only white-box AI methods can be readily interpreted to see the logic behind models' predictions. In recent years with machine learning taking over new industries and applications, where the number of users far outnumber experts that grok the models and algorithms, the conversation around interpretability has become an important one.
Update: I have since refined these ideas in The Mythos of Model Interpretability, an academic paper presented at the 2016 ICML Workshop on Human Interpretability of Machine Learning. Neural networks, on the other hand, are black boxes. By this, it's suggested that we can pass input in, and observe what comes out, but we lack the ability to reason about what happened in the middle. To confirm the prevalence of this narrative, I ran a Google search for "neural network black box", yielding 2,410,000 results. By comparison, "logistic regression black box" turns up 600,000 results.
Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related, and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the Predictive, Descriptive, Relevant (PDR) framework for discussing interpretations. The PDR framework provides three overarching desiderata for evaluation: predictive accuracy, descriptive accuracy and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post-hoc categories, with sub-groups including sparsity, modularity and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often under-appreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods.
The ubiquity of machine learning based predictive models in modern society naturally leads people to ask how trustworthy those models are? In predictive modeling, it is quite common to induce a trade-off between accuracy and interpretability. For instance, doctors would like to know how effective some treatment will be for a patient or why the model suggested a particular medication for a patient exhibiting those symptoms? We acknowledge that the necessity for interpretability is a consequence of an incomplete formalisation of the problem, or more precisely of multiple meanings adhered to a particular concept. For certain problems, it is not enough to get the answer (what), the model also has to provide an explanation of how it came to that conclusion (why), because a correct prediction, only partially solves the original problem. In this article we extend existing categorisation of techniques to aid model interpretability and test this categorisation.