Decision Tree Learning
Machine Learning in Python has never been easier
At BigML we believe that over the next few years automated, data-driven decisions and data-driven applications are going to change the world. In fact, we think it will be the biggest shift in business efficiency since the dawn of the office calculator, when individuals had "Computer" listed as the title on their business card. We want to help people rapidly and easily create predictive models using their datasets, no matter what size they are. Our easy-to-use, public API is a great step in that direction but a few bindings for popular languages is obviously a big bonus. Thus, we are very happy to announce an open source Python binding to BigML.io, the BigML REST API. You can find it and fork it at Github.
Lost in a random forest: Using Big Data to study rare events News & Analysis
Sudden, broad-scale shifts in public opinion about social problems are relatively rare. Until recently, social scientists were forced to conduct post-hoc case studies of such unusual events that ignore the broader universe of possible shifts in public opinion that do not materialize. The vast amount of data that has recently become available via social media sites such as Facebook and Twitter--as well as the mass-digitization of qualitative archives provide an unprecedented opportunity for scholars to avoid such selection on the dependent variable. Yet the sheer scale of these new data creates a new set of methodological challenges. Conventional linear models, for example, minimize the influence of rare events as "outliers"--especially within analyses of large samples.
How to Bin or Convert Numerical Variables to Categorical Variables with Decision Trees
Why would you want to convert a numerical variable into categorical one? Depending on the situation, it can lead to a better interpretation of the numerical variable, quick segmentation or just an additional feature for building your predictive model by creating bins for the numerical variable. Binning is a popular feature engineering technique. Suppose your hypothesis is that the age of a customer is correlated with their tendency to interact with a mobile app. The age of the user is plotted on x-axis and user interaction with the app is plotted on the y-axis.
Annotated Decision Trees for Simple Moral Machines
Bendel, Oliver (Northwestern Switzerland School of Business)
Autonomization often follows after the automization on which it is based. More and more machines have to make decisions with moral implications. Machine ethics, which can be seen as an equivalent of human ethics, analyses the chances and limits of moral machines. So far, decision trees have not been commonly used for modelling moral machines. This article proposes an approach for creating annotated decision trees, and specifies their central components. The focus is on simple moral machines. The chances of such models are illustrated with the example of a self-driving car that is friendly to humans and animals. Finally the advantages and disadvantages are discussed and conclusions are drawn.
Evaluation of Protein Structural Models Using Random Forests
Cao, Renzhi, Jo, Taeho, Cheng, Jianlin
Protein structure prediction has been a "grand challenge" problem in the structure biology over the last few decades. Protein quality assessment plays a very important role in protein structure prediction. In the paper, we propose a new protein quality assessment method which can predict both local and global quality of the protein 3D structural models. Our method uses both multi and single model quality assessment method for global quality assessment, and uses chemical, physical, geometrical features, and global quality score for local quality assessment. CASP9 targets are used to generate the features for local quality assessment. We evaluate the performance of our local quality assessment method on CASP10, which is comparable with two stage-of-art QA methods based on the average absolute distance between the real and predicted distance. In addition, we blindly tested our method on CASP11, and the good performance shows that combining single and multiple model quality assessment method could be a good way to improve the accuracy of model quality assessment, and the random forest technique could be used to train a good local quality assessment model.
MPBART - Multinomial Probit Bayesian Additive Regression Trees
Kindo, Bereket P., Wang, Hao, Peรฑa, Edsel A.
Multinomial probit (MNP) model for discrete choice modeling is often used in economics, market research, political sciences and transportation. It models the choices made by agents given their demographic characteristics and/or the features of the K 2 available choice alternatives. Examples include the study of consumer's purchasing behavior (e.g., McCulloch et al. (2000); Imai and van Dyk (2005)); voting behavior in multi-party elections (e.g., Quinn et al. (1999)); and choice of different modes of transportation (e.g., Bolduc (1999)). Details of the MNP model in which choices depend on predictors in a linear fashion is studied in McFadden et al.(1973); McFadden(1989); Keane(1992); McCulloch and Rossi (1994); Nobile (1998); McCulloch et al. (2000); Imai and van Dyk (2005); Train (2009); Burgette and Nordheim (2012) among others. Among widely used multinomial choice modeling procedures are the multinomial logit model (e.g., McFadden et al. (1973); Train (2009)) and multinomial probit model (e.g., McFadden (1989); McCulloch and Rossi (1994); Imai and van Dyk (2005)). The former relies on an assumption that a choice outcome is independent of removal (or introduction) of an irrelevant choice alternative while the latter including MPBART does not make this restrictive assumption.
Finding structure in data using multivariate tree boosting
Miller, Patrick J., Lubke, Gitta H., McArtor, Daniel B., Bergeman, C. S.
Technology and collaboration enable dramatic increases in the size of psychological and psychiatric data collections, but finding structure in these large data sets with many collected variables is challenging. Decision tree ensembles like random forests (Strobl, Malley, and Tutz, 2009) are a useful tool for finding structure, but are difficult to interpret with multiple outcome variables which are often of interest in psychology. To find and interpret structure in data sets with multiple outcomes and many predictors (possibly exceeding the sample size), we introduce a multivariate extension to a decision tree ensemble method called Gradient Boosted Regression Trees (Friedman, 2001). Our method, multivariate tree boosting, can be used for identifying important predictors, detecting predictors with non-linear effects and interactions without specification of such effects, and for identifying predictors that cause two or more outcome variables to covary without parametric assumptions. We provide the R package 'mvtboost' to estimate, tune, and interpret the resulting model, which extends the implementation of univariate boosting in the R package 'gbm' (Ridgeway, 2013) to continuous, multivariate outcomes. To illustrate the approach, we analyze predictors of psychological well-being (Ryff and Keyes, 1995). Simulations verify that our approach identifies predictors with non-linear effects and achieves high prediction accuracy, exceeding or matching the performance of (penalized) multivariate multiple regression and multivariate decision trees over a wide range of conditions.
A Framework to Adjust Dependency Measure Estimates for Chance
Romano, Simone, Vinh, Nguyen Xuan, Bailey, James, Verspoor, Karin
Estimating the strength of dependency between two variables is fundamental for exploratory analysis and many other applications in data mining. For example: non-linear dependencies between two continuous variables can be explored with the Maximal Information Coefficient (MIC); and categorical variables that are dependent to the target class are selected using Gini gain in random forests. Nonetheless, because dependency measures are estimated on finite samples, the interpretability of their quantification and the accuracy when ranking dependencies become challenging. Dependency estimates are not equal to 0 when variables are independent, cannot be compared if computed on different sample size, and they are inflated by chance on variables with more categories. In this paper, we propose a framework to adjust dependency measure estimates on finite samples. Our adjustments, which are simple and applicable to any dependency measure, are helpful in improving interpretability when quantifying dependency and in improving accuracy on the task of ranking dependencies. In particular, we demonstrate that our approach enhances the interpretability of MIC when used as a proxy for the amount of noise between variables, and to gain accuracy when ranking variables during the splitting procedure in random forests.
Directional Decision Lists
In this paper we introduce a novel family of decision lists consisting of highly interpretable models which can be learned efficiently in a greedy manner. The defining property is that all rules are oriented in the same direction. Particular examples of this family are decision lists with monotonically decreasing (or increasing) probabilities. On simulated data we empirically confirm that the proposed model family is easier to train than general decision lists. We exemplify the practical usability of our approach by identifying problem symptoms in a manufacturing process.