Collaborating Authors

Decision Tree Learning

Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

Heavy review Summary: The paper concerns multi-class problems with a large number of classes. It introduces a novel label tree classifier that learns and predicts in logarithmic time in the number of classes. Theoretical guarantees in terms of a boosting-like theorem have been proven. Moreover, not only node classifiers, but also the structure of the tree is trained online. Additionally, the authors show a simple subtree swapping procedure that ensures proper balancing of the tree.

Logarithmic Time Online Multiclass prediction John Langford Courant Institute of Mathematical Sciences Microsoft Research New York, NY, USA

Neural Information Processing Systems

We study the problem of multiclass classification with an extremely large number of classes (k), with the goal of obtaining train and test time complexity logarithmic in the number of classes. We develop top-down tree construction approaches for constructing logarithmic depth trees. On the theoretical front, we formulate a new objective function, which is optimized at each node of the tree and creates dynamic partitions of the data which are both pure (in terms of class labels) and balanced. We demonstrate that under favorable conditions, we can construct logarithmic depth trees that have leaves with low label entropy. However, the objective function at the nodes is challenging to optimize computationally. We address the empirical problem with a new online decision tree construction procedure. Experiments demonstrate that this online algorithm quickly achieves improvement in test error compared to more common logarithmic training time approaches, which makes it a plausible method in computationally constrained large-k applications.

Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale

Neural Information Processing Systems

Deep distributed decision trees and tree ensembles have grown in importance due to the need to model increasingly large datasets.

Pruning Random Forests for Prediction on a Budget

Neural Information Processing Systems

We propose to prune a random forest (RF) for resource-constrained prediction. We first construct a RF and then prune it to optimize expected feature cost & accuracy. We pose pruning RFs as a novel 0-1 integer program with linear constraints that encourages feature re-use. We establish total unimodularity of the constraint set to prove that the corresponding LP relaxation solves the original integer program. We then exploit connections to combinatorial optimization and develop an efficient primal-dual algorithm, scalable to large datasets. In contrast to our bottom-up approach, which benefits from good RF initialization, conventional methods are top-down acquiring features based on their utility value and is generally intractable, requiring heuristics. Empirically, our pruning algorithm outperforms existing state-of-the-art resource-constrained algorithms.

A Communication-Efficient Parallel Algorithm for Decision Tree

Neural Information Processing Systems

Decision tree (and its extensions such as Gradient Boosting Decision Trees and Random Forest) is a widely used machine learning algorithm, due to its practical effectiveness and model interpretability. With the emergence of big data, there is an increasing need to parallelize the training process of decision tree. However, most existing attempts along this line suffer from high communication costs. In this paper, we propose a new algorithm, called Parallel Voting Decision Tree (PV-Tree), to tackle this challenge. After partitioning the training data onto a number of (e.g., M) machines, this algorithm performs both local voting and global voting in each iteration.

Reviews: Universal consistency and minimax rates for online Mondrian Forests

Neural Information Processing Systems

Summary: This paper proposes a modification of Mondorian Forest which is a variant of Random Forest, a majority vote of decision trees. The authors show that the modified algorithm has the consistency property while the original algorithm does not have one. In particular, when the conditional probability function is Lipschitz, the proposed algorithm achieves the minimax error rate, where the lower bound is previously known. Comments: The technical contribution is to refine the original version of the Mondorian Forest and prove its consistency. The theoretical results are nice and solid. The main idea comes from the original algorithm, thus the originality of the paper is a bit incremental.

Reviews: Label Distribution Learning Forests

Neural Information Processing Systems

The authors describe a method for label distribution learning based on differentiable decision trees. The authors use differentiable sigmoid units to estimate a label distribution using leaf nodes of trees. Learning in split nodes is done via backprop. The authors compare their work with relevant methods on learning label distributions and show the competitiveness of their method. I think this is a good paper, providing a sound methodology for learning LD.

Label Distribution Learning Forests

Neural Information Processing Systems

Label distribution learning (LDL) is a general learning framework, which assigns to an instance a distribution over a set of labels rather than a single label or multiple labels. Current LDL methods have either restricted assumptions on the expression form of the label distribution or limitations in representation learning, e.g., to learn deep features in an end-to-end manner. This paper presents label distribution learning forests (LDLFs) - a novel label distribution learning algorithm based on differentiable decision trees, which have several advantages: 1) Decision trees have the potential to model any general form of label distributions by a mixture of leaf node predictions.

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Neural Information Processing Systems

Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much).