Decision Tree Learning
Fast Sparse Decision Tree Optimization via Reference Ensembles
McTavish, Hayden, Zhong, Chudi, Achermann, Reto, Karimalis, Ilias, Chen, Jacques, Rudin, Cynthia, Seltzer, Margo
Sparse decision tree optimization has been one of the most fundamental problems in AI since its inception and is a challenge at the core of interpretable machine learning. Sparse decision tree optimization is computationally hard, and despite steady effort since the 1960's, breakthroughs have only been made on the problem within the past few years, primarily on the problem of finding optimal sparse decision trees. However, current state-of-the-art algorithms often require impractical amounts of computation time and memory to find optimal or near-optimal trees for some real-world datasets, particularly those having several continuous-valued features. Given that the search spaces of these decision tree optimization problems are massive, can we practically hope to find a sparse decision tree that competes in accuracy with a black box machine learning model? We address this problem via smart guessing strategies that can be applied to any optimal branch-and-bound-based decision tree algorithm. We show that by using these guesses, we can reduce the run time by multiple orders of magnitude, while providing bounds on how far the resulting trees can deviate from the black box's accuracy and expressive power. Our approach enables guesses about how to bin continuous features, the size of the tree, and lower bounds on the error for the optimal decision tree. Our experiments show that in many cases we can rapidly construct sparse decision trees that match the accuracy of black box models. To summarize: when you are having trouble optimizing, just guess.
Understanding Tree Models
Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Life is full of decisions and eventually, we do measure which option to take on some logical-based analysis.
DECISION TREE IN A NUTSHELL
When a bank considers whether it would offer a loan to someone or not, it considers a chronological list of questions to decide if it's safe to approve such a loan. The questions under consideration could begin with simple ones such as what's the individual's annual income. Based on the answers, the next set of questions could involve finding out if the person has any existing loans, has defaulted on credit card payments, etc. Assuming the person draws a salary of $30,000, has no existing loans or criminal record, and makes his credit card payments on time, the bank may offer him the loan. You can call this a basic form of a decision tree.
Visualizing Ensemble Predictions of Music Mood
Music mood classification has been a challenging problem in comparison with some other classification problems (e.g., genre, composer, or period). One solution for addressing this challenging is to use an of ensemble machine learning models. In this paper, we show that visualization techniques can effectively convey the popular prediction as well as uncertainty at different music sections along the temporal axis, while enabling the analysis of individual ML models in conjunction with their application to different musical data. In addition to the traditional visual designs, such as stacked line graph, ThemeRiver, and pixel-based visualization, we introduced a new variant of ThemeRiver, called "dual-flux ThemeRiver", which allows viewers to observe and measure the most popular prediction more easily than stacked line graph and ThemeRiver. Testing indicates that visualizing ensemble predictions is helpful both in model-development workflows and for annotating music using model predictions.
Machine Learning-based Prediction of Porosity for Concrete Containing Supplementary Cementitious Materials
Porosity has been identified as the key indicator of the durability properties of concrete exposed to aggressive environments. This paper applies ensemble learning to predict porosity of high-performance concrete containing supplementary cementitious materials. The concrete samples utilized in this study are characterized by eight composition features including w/b ratio, binder content, fly ash, GGBS, superplasticizer, coarse/fine aggregate ratio, curing condition and curing days. The assembled database consists of 240 data records, featuring 74 unique concrete mixture designs. The proposed machine learning algorithms are trained on 180 observations (75%) chosen randomly from the data set and then tested on the remaining 60 observations (25%). The numerical experiments suggest that the regression tree ensembles can accurately predict the porosity of concrete from its mixture compositions. Gradient boosting trees generally outperforms random forests in terms of prediction accuracy. For random forests, the out-of-bag error based hyperparameter tuning strategy is found to be much more efficient than k-Fold Cross-Validation.
Confidence intervals for the random forest generalization error
How confident can we be in the generalization capacity of a predictive model? Of the many devices discussed in the statistical learning literature [1, 2, 3], a simple random split of the original data into training and test sets, and methods of folded cross-validation, stand out as the most common tools used to tackle the generalization issue. Availability of point estimates for the generalization error given by these procedures naturally raises the question of how to quantify the uncertainty involved in these estimates spending a manageable computational cost. Random forests [4] elegantly provide an alternative low cost (almost free) point estimate of the generalization error without requiring splittings of the data, and avoiding the computational burden of retraining the predictive model several times. The bagging mechanism [5] used to construct the ensemble of trees implies that each training data point is not used (stays "out-of-bag") when growing approximately 36.8% of the trees in the forest. This property gives us the so called out-of-bag estimate of the random forest generalization error: for each observation, using a suitable loss function, we compute the predictive error made by the random subforest whose trees didn't include the observation under consideration in its training process; the out-of-bag estimate is the average of these prediction errors over the whole training sample. 1
Learning Optimal Decision Sets and Lists with SAT
Yu, Jinqiang, Ignatiev, Alexey, Stuckey, Peter J., Le Bodic, Pierre
Decision sets and decision lists are two of the most easily explainable machine learning models. Given the renewed emphasis on explainable machine learning decisions, both of these machine learning models are becoming increasingly attractive, as they combine small size and clear explainability. In this paper, we define size as the total number of literals in the SAT encoding of these rule-based models as opposed to earlier work that concentrates on the number of rules. In this paper, we develop approaches to computing minimum-size "perfect" decision sets and decision lists, which are perfectly accurate on the training data, and minimal in size, making use of modern SAT solving technology. We also provide a new method for determining optimal sparse alternatives, which trade off size and accuracy. The experiments in this paper demonstrate that the optimal decision sets computed by the SAT-based approach are comparable with the best heuristic methods, but much more succinct, and thus, more explainable. We contrast the size and test accuracy of optimal decisions lists versus optimal decision sets, as well as other state-of-the-art methods for determining optimal decision lists. Finally, we examine the size of average explanations generated by decision sets and decision lists.
Shrub Ensembles for Online Classification
Buschjäger, Sebastian, Hess, Sibylle, Morik, Katharina
Online learning algorithms have become a ubiquitous tool in the machine learning toolbox and are frequently used in small, resource-constraint environments. Among the most successful online learning methods are Decision Tree (DT) ensembles. DT ensembles provide excellent performance while adapting to changes in the data, but they are not resource efficient. Incremental tree learners keep adding new nodes to the tree but never remove old ones increasing the memory consumption over time. Gradient-based tree learning, on the other hand, requires the computation of gradients over the entire tree which is costly for even moderately sized trees. In this paper, we propose a novel memory-efficient online classification ensemble called shrub ensembles for resource-constraint systems. Our algorithm trains small to medium-sized decision trees on small windows and uses stochastic proximal gradient descent to learn the ensemble weights of these `shrubs'. We provide a theoretical analysis of our algorithm and include an extensive discussion on the behavior of our approach in the online setting. In a series of 2~959 experiments on 12 different datasets, we compare our method against 8 state-of-the-art methods. Our Shrub Ensembles retain an excellent performance even when only little memory is available. We show that SE offers a better accuracy-memory trade-off in 7 of 12 cases, while having a statistically significant better performance than most other methods. Our implementation is available under https://github.com/sbuschjaeger/se-online .
BANKNOTE AUTHENTICATION USING RANDOM FOREST -- WITH SOURCE CODE -- EASY PROJECT
In today's blog, we will see that how we can perform Bank Note Authentication or how we can classify Bank Notes into fake or authentic classes based on numeric features like variance, skewness, kurtosis, entropy. This is going to be a very short blog, so without any further due. To explore more Machine Learning, Deep Learning, Computer Vision, NLP, Flask Projects visit my blog.