Goto

Collaborating Authors

 Decision Tree Learning


Refining interaction search through signed iterative Random Forests

arXiv.org Machine Learning

Advances in supervised learning have enabled accurate prediction in biological systems governed by complex interactions among biomolecules. However, state-of-the-art predictive algorithms are typically black-boxes, learning statistical interactions that are difficult to translate into testable hypotheses. The iterative Random Forest algorithm took a step towards bridging this gap by providing a computationally tractable procedure to identify the stable, high-order feature interactions that drive the predictive accuracy of Random Forests (RF). Here we refine the interactions identified by iRF to explicitly map responses as a function of interacting features. Our method, signed iRF, describes subsets of rules that frequently occur on RF decision paths. We refer to these rule subsets as signed interactions. Signed interactions share not only the same set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We describe stable and predictive importance metrics to rank signed interactions. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate our proposed approach in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of enhancer activity, s-iRF recovers one of the few experimentally validated high-order interactions and suggests novel enhancer elements where this interaction may be active. In the case of spatial gene expression patterns, s-iRF recovers all 11 reported links in the gap gene network. By refining the process of interaction recovery, our approach has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension.


Constructing classification trees using column generation

arXiv.org Machine Learning

In classification problems, the goal is to decide the class membership of a set of observations, by using available information on features and class membership of a training data set. Decision trees are one of the most popular models for solving this problem, due to their effectiveness and high interpretability. In this work, we focus on constructing univariate binary decision trees of prespecified depth. In a univariate binary decision tree, each internal node contains a test regarding the value of one single feature of the data set, while the leaves contain the target classes. The problem of constructing (learning) a classification tree (CTCP), is the problem of finding a set of optimal tests (decision checks), such that the assignment of target classes to rows satisfies a certain criteria. A commonly encountered objective is accuracy, measured as the number of correct predictions in a training set. As the problem of learning optimal decision trees is an NPcomplete problem (Hyafil and Rivest 1976), heuristics such as CART (Breiman et al. 1984) and ID3 (Quinlan 1986) are widely used. These greedy algorithms build a tree recursively, starting from a single node.


Random Forests and the Bias-Variance Tradeoff โ€“ Towards Data Science

#artificialintelligence

The Random Forest is an extremely popular machine learning algorithm. Often, with not too much pre-processing, one can throw together a quick and dirty model with no hyperparameter tuning and acheive results that aren't awful. As an example, I put together a RandomForestRegressor in Python using scikit-learn for the New York City Taxi Fare Prediction playground competition on Kaggle recently, passing in no arguments to the model constructor and using 1/100 for the training data (554238 of 55M rows), for a validation Rยฒ of 0.8. NOTE: This snippet assumes you split the data into training and validation sets with your features and target variable separated. You can see the full code on my GitHub profile.


A gentle introduction to decision trees using R

#artificialintelligence

Most techniques of predictive analytics have their origins in probability or statistical theory (see my post on Naรฏve Bayes, for example). In this post I'll look at one that has more a commonplace origin: the way in which humans make decisions. When making decisions, we typically identify the options available and then evaluate them based on criteria that are important to us. The intuitive appeal of such a procedure is in no small measure due to the fact that it can be easily explained through a visual. The tree structure depicted here provides a neat, easy-to-follow description of the issue under consideration and its resolution.



Offline Multi-Action Policy Learning: Generalization and Optimization

arXiv.org Machine Learning

In many settings, a decision-maker wishes to learn a rule, or policy, that maps from observable characteristics of an individual to an action. Examples include selecting offers, prices, advertisements, or emails to send to consumers, as well as the problem of determining which medication to prescribe to a patient. While there is a growing body of literature devoted to this problem, most existing results are focused on the case where data comes from a randomized experiment, and further, there are only two possible actions, such as giving a drug to a patient or not. In this paper, we study the offline multi-action policy learning problem with observational data and where the policy may need to respect budget constraints or belong to a restricted policy class such as decision trees. We build on the theory of efficient semi-parametric inference in order to propose and implement a policy learning algorithm that achieves asymptotically minimax-optimal regret. To the best of our knowledge, this is the first result of this type in the multi-action setup, and it provides a substantial performance improvement over the existing learning algorithms. We then consider additional computational challenges that arise in implementing our method for the case where the policy is restricted to take the form of a decision tree. We propose two different approaches, one using a mixed integer program formulation and the other using a tree-search based algorithm.


Equality Constrained Decision Trees: For the Algorithmic Enforcement of Group Fairness

arXiv.org Artificial Intelligence

Fairness, through its many forms and definitions, has become an important issue facing the machine learning community. In this work, we consider how to incorporate group fairness constraints in kernel regression methods. More specifically, we focus on examining the incorporation of these constraints in decision tree regression when cast as a form of kernel regression, with direct applications to random forests and boosted trees amongst other widespread popular inference techniques. We show that order of complexity of memory and computation is preserved for such models and bounds the expected perturbations to the model in terms of the number of leaves of the trees. Importantly, the approach works on trained models and hence can be easily applied to models in current use.


Mixed-Integer Convex Nonlinear Optimization with Gradient-Boosted Trees Embedded

arXiv.org Artificial Intelligence

Decision trees usefully represent sparse, high dimensional and noisy data. Having learned a function from this data, we may want to thereafter integrate the function into a larger decision-making problem, e.g., for picking the best chemical process catalyst. We study a large-scale, industrially-relevant mixed-integer nonlinear nonconvex optimization problem involving both gradient-boosted trees and penalty functions mitigating risk. This mixed-integer optimization problem with convex penalty terms broadly applies to optimizing pre-trained regression tree models. Decision makers may wish to optimize discrete models to repurpose legacy predictive models, or they may wish to optimize a discrete model that particularly well-represents a data set. We develop several heuristic methods to find feasible solutions, and an exact, branch-and-bound algorithm leveraging structural properties of the gradient-boosted trees and penalty functions. We computationally test our methods on concrete mixture design instance and a chemical catalysis industrial instance.


Is Machine Learning Analytics or AI? - International Institute for Analytics

#artificialintelligence

One of the definitional debates that bedevils the artificial intelligence (AI) field is whether machine learning is an AI-based method or technology. Or is it just an analytics-based activity? After all, it is statistical in nature, and attempts--as virtually analytical methods do--to fit a line or curve to a set of data points. And what difference does it make? Basic machine learning is practically indistinguishable from predictive analytics.


Machine Learning Suites for Online Toxicity Detection

arXiv.org Machine Learning

To identify and classify toxic online commentary, the modern tools of data science transform raw text into key features from which either thresholding or learning algorithms can make predictions for monitoring offensive conversations. We systematically evaluate 62 classifiers representing 19 major algorithmic families against features extracted from the Jigsaw dataset of Wikipedia comments. We compare the classifiers based on statistically significant differences in accuracy and relative execution time. Among these classifiers for identifying toxic comments, tree-based algorithms provide the most transparently explainable rules and rank-order the predictive contribution of each feature. Among 28 features of syntax, sentiment, emotion and outlier word dictionaries, a simple bad word list proves most predictive of offensive commentary. Introduction In 2015, the Twitter CEO, Dick Costello, took personal responsibility for online harassment, trolling and abuse on the Twitter platform.