We propose Deep Hierarchical Machine (DHM), a model inspired from the divide-and-conquer strategy while emphasizing representation learning ability and flexibility. A stochastic routing framework as used by recent deep neural decision/regression forests is incorporated, but we remove the need to evaluate unnecessary computation paths by utilizing a different topology and introducing a probabilistic pruning technique. We also show a specified version of DHM (DSHM) for efficiency, which inherits the sparse feature extraction process as in traditional decision tree with pixel-difference feature. To achieve sparse feature extraction, we propose to utilize sparse convolution operation in DSHM and show one possibility of introducing sparse convolution kernels by using local binary convolution layer. DHM can be applied to both classification and regression problems, and we validate it on standard image classification and face alignment tasks to show its advantages over past architectures.
Abstract--Deep Neural Networks have achieved huge success at a wide spectrum of applications from language modeling, computer vision to speech recognition. However, nowadays, good performance alone is not sufficient to satisfy the needs of practical deployment where interpretability is demanded for cases involving ethicsand mission critical applications. The complex models of Deep Neural Networks make it hard to understand and reason the predictions, which hinders its further progress. To tackle this problem, we apply the Knowledge Distillation technique to distill Deep Neural Networks into decision trees in order to attain good performance and interpretability simultaneously. We formulate the problem at hand as a multi-output regression problem and the experiments demonstrate that the student model achieves significantly better accuracy performance (about 1% to 5%) than vanilla decision trees at the same level of tree depth. The experiments are implemented on the TensorFlow platform to make it scalable to big datasets. To the best of our knowledge, we are the first to distill Deep Neural Networks into vanilla decision trees on multi-class datasets. I. INTRODUCTION Despite Deep Neural Networks' (DNN) superior discrimination powerin many fields, the logics of each hidden feature representations before the output layer still remain to be blackbox. Understandingwhy a specific prediction is made is of utmost importance for the end-users to trust and adopt the model, and for the system designers to refine the model by performing feature engineering and parameter tuning.
In this work, we propose a novel node splitting method for regression trees and incorporate it into the regression forest framework. Unlike traditional binary splitting, where the splitting rule is selected from a predefined set of binary splitting rules via trial-and-error, the proposed node splitting method first finds clusters of the training data which at least locally minimize the empirical loss without considering the input space. Then splitting rules which preserve the found clusters as much as possible are determined by casting the problem into a classification problem. Consequently, our new node splitting method enjoys more freedom in choosing the splitting rules, resulting in more efficient tree structures. In addition to the Euclidean target space, we present a variant which can naturally deal with a circular target space by the proper use of circular statistics. We apply the regression forest employing our node splitting to head pose estimation (Euclidean target space) and car direction estimation (circular target space) and demonstrate that the proposed method significantly outperforms state-of-the-art methods (38.5% and 22.5% error reduction respectively).
We propose and systematically evaluate three strategies for training dynamically-routed artificial neural networks: graphs of learned transformations through which different input signals may take different paths. Though some approaches have advantages over others, the resulting networks are often qualitatively similar. We find that, in dynamically-routed networks trained to classify images, layers and branches become specialized to process distinct categories of images. Additionally, given a fixed computational budget, dynamically-routed networks tend to perform better than comparable statically-routed networks.