Decision Tree Learning
Fully-Dynamic Approximate Decision Trees With Worst-Case Update Time Guarantees
We give the first algorithm that maintains an approximate decision tree over an arbitrary sequence of insertions and deletions of labeled examples, with strong guarantees on the worst-case running time per update request. For instance, we show how to maintain a decision tree where every vertex has Gini gain within an additive $\alpha$ of the optimum by performing $O\Big(\frac{d\,(\log n)^4}{\alpha^3}\Big)$ elementary operations per update, where $d$ is the number of features and $n$ the maximum size of the active set (the net result of the update requests). We give similar bounds for the information gain and the variance gain. In fact, all these bounds are corollaries of a more general result, stated in terms of decision rules -- functions that, given a set $S$ of labeled examples, decide whether to split $S$ or predict a label. Decision rules give a unified view of greedy decision tree algorithms regardless of the example and label domains, and lead to a general notion of $\epsilon$-approximate decision trees that, for natural decision rules such as those used by ID3 or C4.5, implies the gain approximation guarantees above. The heart of our work provides a deterministic algorithm that, given any decision rule and any $\epsilon > 0$, maintains an $\epsilon$-approximate tree using $O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}{\epsilon}\right)$ operations per update, where $f(n)$ is the complexity of evaluating the rule over a set of $n$ examples and $h$ is the maximum height of the maintained tree.
Gaussian Process-Gated Hierarchical Mixtures of Experts
Liu, Yuhao, Ajirak, Marzieh, Djuric, Petar
In this paper, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs) that are used for building gates and experts. Unlike in other mixtures of experts where the gating models are linear to the input, the gating functions of our model are inner nodes built with Gaussian processes based on random features that are non-linear and non-parametric. Further, the experts are also built with Gaussian processes and provide predictions that depend on test data. The optimization of the GPHMEs is carried out by variational inference. There are several advantages of the proposed GPHMEs. One is that they outperform tree-based HME benchmarks that partition the data in the input space. Another advantage is that they achieve good performance with reduced complexity. A third advantage of the GPHMEs is that they provide interpretability of deep Gaussian processes and more generally of deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets even with quite modest sizes.
Tree Learning: Optimal Algorithms and Sample Complexity
Avdiukhin, Dmitrii, Yaroslavtsev, Grigory, Vainstein, Danny, Fischer, Orr, Das, Sauman, Mirza, Faraz
The algorithmic problem of constructing hierarchical data representations has been of major importance for many decades, due to its applications in statistics [Ward Jr, 1963, Gower and Ross, 1969], entomology [Michener and Sokal, 1957], plant biology [Sorensen, 1948], genomics [Eisen et al., 1998] and other domains. Efficient collection of labeled data is a problem of key importance for construction of hierarchical data representations. In this paper we consider the problem of constructing tree representations of data from labeled samples, focusing on understanding the optimal number of samples required for this task. The most basic type of a label that allows one to construct a tree representation consists of a triplet of points (x, y, z) labeled according to the induced hierarchical structure within the triplet. For example, given images of a cat, a dog, and a plane, the label would describe the cat and the dog as being more similar to each other than to the plane. This "odd one out" type of label is the simplest one to collect in a crowdsourcing setting. In this paper, we focus on understanding the number of such labeled samples required in order to construct a tree representation of the underlying data, which enables one to accurately predict subsequent labels in the future.
Kinodynamic Rapidly-exploring Random Forest for Rearrangement-Based Nonprehensile Manipulation
Ren, Kejia, Chanrungmaneekul, Podshara, Kavraki, Lydia E., Hang, Kaiyu
Rearrangement-based nonprehensile manipulation still remains as a challenging problem due to the high-dimensional problem space and the complex physical uncertainties it entails. We formulate this class of problems as a coupled problem of local rearrangement and global action optimization by incorporating free-space transit motions between constrained rearranging actions. We propose a forest-based kinodynamic planning framework to concurrently search in multiple problem regions, so as to enable global exploration of the most task-relevant subspaces, while facilitating effective switches between local rearranging actions. By interleaving dynamic horizon planning and action execution, our framework can adaptively handle real-world uncertainties. With extensive experiments, we show that our framework significantly improves the planning efficiency and manipulation effectiveness while being robust against various uncertainties.
Decision trees compensate for model misspecification
Panton, Hugh, Leech, Gavin, Aitchison, Laurence
Boost (Chen and Guestrin, 2016) with default tree-depth 3 and default tree number 100, could be depicted in full: The best-performing models in ML are not interpretable. If we can explain why they outperform, we may be able to replicate these mechanisms and obtain both interpretability and performance. One example are decision trees and their descendent gradient boosting machines (GBMs). These perform well in the presence of complex interactions, with tree depth governing the order of interactions. However, interactions cannot fully account for the depth of trees found in practice. We confirm 5 alternative hypotheses about the role of tree depth in performance in the absence of true interactions, and present results from experiments on a battery of datasets. Part of the success of tree models is due to their robustness to various forms of mis-specification.
Fast Linear Model Trees by PILOT
Raymaekers, Jakob, Rousseeuw, Peter J., Verdonck, Tim, Yao, Ruicong
Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addition, they are more prone to overfitting and extrapolation issues than standard regression trees. In this paper we introduce PILOT, a new algorithm for linear model trees that is fast, regularized, stable and interpretable. PILOT trains in a greedy fashion like classic regression trees, but incorporates an $L^2$ boosting approach and a model selection rule for fitting linear models in the nodes. The abbreviation PILOT stands for $PI$ecewise $L$inear $O$rganic $T$ree, where `organic' refers to the fact that no pruning is carried out. PILOT has the same low time and space complexity as CART without its pruning. An empirical study indicates that PILOT tends to outperform standard decision trees and other linear model trees on a variety of data sets. Moreover, we prove its consistency in an additive model setting under weak assumptions. When the data is generated by a linear model, the convergence rate is polynomial.
Demystifying the Random Forest. Deconstructing and Understanding this…
In classical Machine Learning, Random Forests have been a silver bullet type of model. In this post, I want to better understand the components that make up a Random Forest. To accomplish this, I am going to deconstruct the Random Forest into its most basic components and explain what is going on in each level of computation. By the end, we will have attained a much deeper understanding of how Random Forests work and how to work with them with more intuition. The examples we will use will be focused on classification, but many of the principles apply to the regression scenarios as well. Let's start by invoking a classic Random Forest pattern.
Data-driven Protection of Transformers, Phase Angle Regulators, and Transmission Lines in Interconnected Power Systems
This dissertation highlights the growing interest in and adoption of machine learning (ML) approaches for fault detection in modern power grids. Once a fault has occurred, it must be identified quickly and preventative steps must be taken to remove or insulate it. As a result, detecting, locating, and classifying faults early and accurately can improve safety and dependability while reducing downtime and hardware damage. ML-based solutions and tools to carry out effective data processing and analysis to aid power system operations and decision-making are becoming preeminent with better system condition awareness and data availability. Power transformers, Phase Shift Transformers or Phase Angle Regulators, and transmission lines are critical components in power systems, and ensuring their safety is a primary issue. Differential relays are commonly employed to protect transformers, whereas distance relays are utilized to protect transmission lines. Magnetizing inrush, overexcitation, and current transformer saturation make transformer protection a challenge. Furthermore, non-standard phase shift, series core saturation, low turn-to-turn, and turn-to-ground fault currents are non-traditional problems associated with Phase Angle Regulators. Faults during symmetrical power swings and unstable power swings may cause mal-operation of distance relays and unintentional and uncontrolled islanding. The distance relays also mal-operate for transmission lines connected to type-3 wind farms. The conventional protection techniques would no longer be adequate to address the above challenges due to limitations in handling and analyzing massive amounts of data, limited generalizability, incapability to model non-linear systems, etc. These limitations of differential and distance protection methods bring forward the motivation of using ML in addressing various protection challenges.
A Cloud-Based Energy Management Strategy for Hybrid Electric City Bus Considering Real-Time Passenger Load Prediction
Shi, Junzhe, Xu, Bin, Zhou, Xingyu, Hou, Jun
Electric city bus gains popularity in recent years for its low greenhouse gas emission, low noise level, etc. Different from a passenger car, the weight of a city bus varies significantly with different amounts of onboard passengers. After analyzing the importance of battery aging and passenger load effects on an optimal energy management strategy, this study introduces the passenger load prediction into the hybrid-electric city buses energy management problem, which is not well studied in the existing literature. The average model, Decision Tree, Gradient Boost Decision Tree, and Neural Networks models are compared in the passenger load prediction. The Gradient Boost Decision Tree model is selected due to its best accuracy and high stability. Given the predicted passenger load, a dynamic programming algorithm determines the optimal power demand for supercapacitor and battery by optimizing the battery aging and energy usage leveraging cloud techniques. Then, rule extraction is conducted on dynamic programming results, and the rule is real-time loaded to the vehicle onboard controller to handle prediction errors and uncertainties. The proposed cloud-based Dynamic Programming and rule extraction framework with the passenger load prediction show 4% and 11% lower bus operating costs in off-peak and peak hours, respectively. The operating cost by the proposed framework is less than 1% of the dynamic programming with the true passenger load information.
Analyzing Tree Architectures in Ensembles via Neural Tangent Kernel
Kanoh, Ryuichi, Sugiyama, Mahito
A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although soft trees can take various architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with an infinite number of trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.