Decision Tree Learning
Random forests with random projections of the output space for high dimensional multi-label classification
Joly, Arnaud, Geurts, Pierre, Wehenkel, Louis
We adapt the idea of random projections applied to the output space, so as to enhance tree-based ensemble methods in the context of multi-label classification. We show how learning time complexity can be reduced without affecting computational complexity and accuracy of predictions. We also show that random output space projections may be used in order to reach different bias-variance tradeoffs, over a broad panel of benchmark problems, and that this may lead to improved accuracy while reducing significantly the computational burden of the learning stage.
Interpreting Tree Ensembles with inTrees
Tree ensembles such as random forests and boosted trees are accurate but difficult to understand, debug and deploy. In this work, we provide the inTrees (interpretable trees) framework that extracts, measures, prunes and selects rules from a tree ensemble, and calculates frequent variable interactions. An rule-based learner, referred to as the simplified tree ensemble learner (STEL), can also be formed and used for future prediction. The inTrees framework can applied to both classification and regression problems, and is applicable to many types of tree ensembles, e.g., random forests, regularized random forests, and boosted trees. We implemented the inTrees algorithms in the "inTrees" R package.
BET: Bayesian Ensemble Trees for Clustering and Prediction in Heterogeneous Data
Duan, Leo L., Clancy, John P., Szczesniak, Rhonda D.
We propose a novel "tree-averaging" model that utilizes the ensemble of classification and regression trees (CART). Each constituent tree is estimated with a subset of similar data. We treat this grouping of subsets as Bayesian ensemble trees (BET) and model them as an infinite mixture Dirichlet process. We show that BET adapts to data heterogeneity and accurately estimates each component. Compared with the bootstrap-aggregating approach, BET shows improved prediction performance with fewer trees. We develop an efficient estimating procedure with improved sampling strategies in both CART and mixture models. We demonstrate these advantages of BET with simulations, classification of breast cancer and regression of lung function measurement of cystic fibrosis patients.
An Evasion and Counter-Evasion Study in Malicious Websites Detection
Xu, Li, Zhan, Zhenxin, Xu, Shouhuai, Ye, Keyin
Malicious websites are a major cyber attack vector, and effective detection of them is an important cyber defense task. The main defense paradigm in this regard is that the defender uses some kind of machine learning algorithms to train a detection model, which is then used to classify websites in question. Unlike other settings, the following issue is inherent to the problem of malicious websites detection: the attacker essentially has access to the same data that the defender uses to train its detection models. This 'symmetry' can be exploited by the attacker, at least in principle, to evade the defender's detection models. In this paper, we present a framework for characterizing the evasion and counter-evasion interactions between the attacker and the defender, where the attacker attempts to evade the defender's detection models by taking advantage of this symmetry. Within this framework, we show that an adaptive attacker can make malicious websites evade powerful detection models, but proactive training can be an effective counter-evasion defense mechanism. The framework is geared toward the popular detection model of decision tree, but can be adapted to accommodate other classifiers.
Decision Trees for Function Evaluation - Simultaneous Optimization of Worst and Expected Cost
Cicalese, Ferdinando, Laber, Eduardo, Saettler, Aline Medeiros
In several applications of automatic diagnosis and active learning a central problem is the evaluation of a discrete function by adaptively querying the values of its variables until the values read uniquely determine the value of the function. In general, the process of reading the value of a variable might involve some cost, computational or even a fee to be paid for the experiment required for obtaining the value. This cost should be taken into account when deciding the next variable to read. The goal is to design a strategy for evaluating the function incurring little cost (in the worst case or in expectation according to a prior distribution on the possible variables' assignments). Our algorithm builds a strategy (decision tree) which attains a logarithmic approxima- tion simultaneously for the expected and worst cost spent. This is best possible under the assumption that $P \neq NP.$
Recognizing Blind Spot Check Activity with Car Drivers Based on Decision Tree Classifier Approach
Kedowide, Colombiano (Tรฉlรฉ-Universitรฉ du Quรฉbec) | Gouin-Vallerand, Charles (Tรฉlรฉ-Universitรฉ du Quรฉbec) | Vallieres, Evelyne (Tรฉlรฉ-Universitรฉ du Quรฉbec)
Blind spot check is important driving activity that is a good indicator of driversโ proficiency and vigilance. By recognizing the blind spot check activity with drivers, it is possible to quantify and qualify the proficiency of the drivers, but also to cross validate this information with other data such the fatigue level. Thus, in this paper, we present a blind spot check activity recognition system where decision tree classifiers are modeled for each drivers and are used to automatically recognize the blind spot checks.
On Boosting Sparse Parities
Reyzin, Lev (University of Illinois at Chicago)
While boosting has been extensively studied, considerablyless attention has been devoted to the task of designing good weaklearning algorithms. In this paper we consider the problem of designing weak learners thatare especially adept to the boosting procedure and specifically the AdaBoost algorithm. First we describe conditions desirable for a weak learning algorithm. We then propose using sparse parity functions as weak learners, which have many of our desired properties, as weak learners in boosting. Our experimental tests show the proposed weak learners tobe competitive with the most widely used ones: decisionstumps and pruned decision trees.
Growing Regression Forests by Classification: Applications to Object Pose Estimation
In this work, we propose a novel node splitting method for regression trees and incorporate it into the regression forest framework. Unlike traditional binary splitting, where the splitting rule is selected from a predefined set of binary splitting rules via trial-and-error, the proposed node splitting method first finds clusters of the training data which at least locally minimize the empirical loss without considering the input space. Then splitting rules which preserve the found clusters as much as possible are determined by casting the problem into a classification problem. Consequently, our new node splitting method enjoys more freedom in choosing the splitting rules, resulting in more efficient tree structures. In addition to the Euclidean target space, we present a variant which can naturally deal with a circular target space by the proper use of circular statistics. We apply the regression forest employing our node splitting to head pose estimation (Euclidean target space) and car direction estimation (circular target space) and demonstrate that the proposed method significantly outperforms state-of-the-art methods (38.5% and 22.5% error reduction respectively).
Cross-Lingual Knowledge Validation Based Taxonomy Derivation from Heterogeneous Online Wikis
Wang, Zhigang (Tsinghua University) | Li, Juanzi (Tsinghua University) | Li, Shuangjie (Tsinghua University) | Li, Mingyang (Tsinghua University) | Tang, Jie (Tsinghua University) | Zhang, Kuo (Sogou Inc.) | Zhang, Kun (Sogou Inc.)
Creating knowledge bases based on the crowd-sourced wikis, like Wikipedia, has attracted significant research interest in the field of intelligent Web. However, the derived taxonomies usually contain many mistakenly imported taxonomic relations due to the difference between the user-generated subsumption relations and the semantic taxonomic relations. Current approaches to solving the problem still suffer the following issues: (i) the heuristic-based methods strongly rely on specific language dependent rules. (ii) the corpus-based methods depend on a large-scale high-quality corpus, which is often unavailable. In this paper, we formulate the cross-lingual taxonomy derivation problem as the problem of cross-lingual taxonomic relation prediction. We investigate different linguistic heuristics and language independent features, and propose a cross-lingual knowledge validation based dynamic adaptive boosting model to iteratively reinforce the performance of taxonomic relation prediction. The proposed approach successfully overcome the above issues, and experiments show that our approach significantly outperforms the designed state-of-the-art comparison methods.
Regression Trees for Longitudinal Data
Kundu, Madan Gopal, Harezlak, Jaroslaw
While studying response trajectory, often the population of interest may be diverse enough to exist distinct subgroups within it and the longitudinal change in response may not be uniform in these subgroups. That is, the timeslope and/or influence of covariates in longitudinal profile may vary among these different subgroups. For example, Raudenbush (2001) used depression as an example to argue that it is incorrect to assume that all the people in a given population would be experiencing either increasing or decreasing levels of depression. In such cases, traditional linear mixed effects model (assuming common parametric form for covariates and time) is not directly applicable for the entire population as a group-averaged trajectory can mask important subgroup differences. Our aim is to identify and characterize longitudinally homogeneous subgroups based on the combination of baseline covariates in the most parsimonious way. This goal can be achieved via constructing regression tree for longitudinal data using baseline covariates as partitioning variables. We have proposed LongCART algorithm to construct regression tree for the longitudinal data. In each node, the proposed LongCART algorithm determines the need for further splitting (i.e. whether parameter(s) of longitudinal profile is influenced by any baseline attributes) via parameter instability tests and thus the decision of further splitting is type-I error controlled. We have obtained the asymptotic results for the proposed instability test and examined finite sample behavior of the whole algorithm through simulation studies. Finally, we have applied the LongCART algorithm to study the longitudinal changes in choline level among HIV patients.