Decision Tree Learning
ABtree: An Algorithm for Subgroup-Based Treatment Assignment
Given two possible treatments, there may exist subgroups who benefit greater from one treatment than the other. This problem is relevant to the field of marketing, where treatments may correspond to different ways of selling a product. It is similarly relevant to the field of public policy, where treatments may correspond to specific government programs. And finally, personalized medicine is a field wholly devoted to understanding which subgroups of individuals will benefit from particular medical treatments. We present a computationally fast tree-based method, ABtree, for treatment effect differentiation. Unlike other methods, ABtree specifically produces decision rules for optimal treatment assignment on a per-individual basis. The treatment choices are selected for maximizing the overall occurrence of a desired binary outcome, conditional on a set of covariates. In this poster, we present the methodology on tree growth and pruning, and show performance results when applied to simulated data as well as real data.
Context-dependent feature analysis with random forests
Sutera, Antonio, Louppe, Gilles, Huynh-Thu, Vรขn Anh, Wehenkel, Louis, Geurts, Pierre
In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances framework in order (i) to identify variables whose relevance is context-dependent and (ii) to characterize as precisely as possible the effect of contextual information on these variables. The usage and the relevance of our framework for highlighting context-dependent variables is illustrated on both artificial and real datasets.
Random forests for survival analysis using maximally selected rank statistics
Wright, Marvin N., Dankowski, Theresa, Ziegler, Andreas
The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption is not always fulfilled. An alternative approach is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistics, which favors splitting variables with many possible split points. Conditional inference forests avoid this split point selection bias. However, linear rank statistics are utilized in current software for conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. We therefore use maximally selected rank statistics for split point selection in random forests for survival analysis. As in conditional inference forests, p-values for association between split points and survival time are minimized. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split point selection is possible. However, there is a trade-off between unbiased split point selection and runtime. In benchmark studies of prediction performance on simulated and real datasets the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used.
Improving performance of random forests for a particular value of outcome by adding chosen features
Choosing features to improve a performance of a particular algorithm is a difficult question. Currently here is PCA, which is difficult to understand (although it can be used out-of-the-box), requires centralizing and scaling of features and is not easy to interpret. In addition, it does not allows to improve prediction performance for a particular outcome (if its accuracy is lower than for others or it has a particular importance). My method enables to use features without preprocessing. Therefore a resulting prediction is easy to explain.
A Selection of Giant Radio Sources from NVSS
Results of the application of pattern recognition techniques to the problem of identifying Giant Radio Sources (GRS) from the data in the NVSS catalog are presented and issues affecting the process are explored. Decision-tree pattern recognition software was applied to training set source pairs developed from known NVSS large angular size radio galaxies. The full training set consisted of 51,195 source pairs, 48 of which were known GRS for which each lobe was primarily represented by a single catalog component. The source pairs had a maximum separation of 20 arc minutes and a minimum component area of 1.87 square arc minutes at the 1.4 mJy level. The importance of comparing resulting probability distributions of the training and application sets for cases of unknown class ratio is demonstrated. The probability of correctly ranking a randomly selected (GRS, non-GRS) pair from the best of the tested classifiers was determined to be 97.8 +/- 1.5%. The best classifiers were applied to the over 870,000 candidate pairs from the entire catalog. Images of higher ranked sources were visually screened and a table of over sixteen hundred candidates, including morphological annotation, is presented. These systems include doubles and triples, Wide-Angle Tail (WAT) and Narrow-Angle Tail (NAT), S- or Z-shaped systems, and core-jets and resolved cores. While some resolved lobe systems are recovered with this technique, generally it is expected that such systems would require a different approach.
The 7 Best Data Science and Machine Learning Podcasts -- The Startup
Data science and machine learning have long been interests of mine, but now that I'm working on Fuzzy.io I need to keep on top of all the news in both fields. My preferred way to do this is through listening to podcasts. I've listened to a bunch of machine learning and data science podcasts in the last few months, so I thought I'd share my favorites: Every other week, they release a 10โ15 minute episode where hosts, Kyle and Linda Polich give a short primer on topics like k-means clustering, natural language processing and decision tree learning, often using analogies related to their pet parrot, Yoshi. This is the only place where you'll learn about k-means clustering via placement of parrot droppings.
Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins
Gurung, Ram B. (Stockholm University) | Lindgren, Tony (Stockholm University) | Bostrรถm, Henrik (Stockholm University)
The standard approach of learning decision trees from histogram data is to treat the bins as independent variables. However, as the underlying dependencies among the bins might not be completely exploited by this approach, an algorithm has been proposed for learning decision trees from histogram data by considering all bins simultaneously while partitioning examples at each node of the tree. Although the algorithm has been demonstrated to improve predictive performance, its computational complexity has turned out to be a major bottleneck, in particular for histograms with a large number of bins. In this paper, we propose instead a sliding window approach to select subsets of the bins to be considered simultaneously while partitioning examples. This significantly reduces the number of possible splits to consider, allowing for substantially larger histograms to be handled. We also propose to evaluate the original bins independently, in addition to evaluating the subsets of bins when performing splits. This ensures that the information obtained by treating bins simultaneously is an additional gain compared to what is considered by the standard approach. Results of experiments on applying the new algorithm to both synthetic and real world datasets demonstrate positive results in terms of predictive performance without excessive computational cost.
Multiplicative Factorization of Multi-Valued NIN-AND Tree Models
Xiang, Yang (University of Guelph) | Jin, Yiting (University of Guelph)
A multi-valued Non-Impeding Noisy-AND (NIN-AND) tree model has the linear complexity and is more expressive than common Causal Independence Models (CIMs). We formulate a Multiplicative Factorization (MF) for multi-valued NIN-AND Tree (NAT) models. In comparison with the MF for binary NAT models (of a undirected tree structure), the proposed MF is a hybrid and multiply connected graphical model. Although a NAT is made of two types of NIN-AND gates, we showthat a sound and space efficient MF requires multiple types of gate MFs, and therefore significantly more sophisticated parameterizationand integration of gate MFs, and soundness analysis. We show that the formulated MF is exact and itsspace complexity is linear on the number $n$ of causes per effect. Based on the proposed MF, we extend the scheme for lazy propagation (LP) with binary NAT-modeled Bayesian Networks (BNs) to multi-valued NAT-modeled BNs. We show that the extended scheme is more powerful than LP based on MF of noisy-MAX. We demonstrate that the scheme allows significantly more efficient LP both in space and in time.
Something is wrong in the way #MachineLearning is being taught to #Developers
The last few years have seen an explosion of interest in Machine Learning (ML) technology and potential applications. Machine Learning is the unsung hero that powers many applications, systems, sensors, devices, and products. Today, Machine Learning is so pervasive that we can often assume its presence in most of the applications and systems without having to specifically call it out. In simple terms, machine learning is a computer's ability to learn from data, and it is one of the most useful tools we have to develop intelligent systems and applications. Machine learning is used widely today for all kinds of tasks, from churn prediction in large companies, to web search, to medical diagnostics, to robotics.