Goto

Collaborating Authors

 Decision Tree Learning


MurTree: Optimal Classification Trees via Dynamic Programming and Search

arXiv.org Artificial Intelligence

Decision tree learning is a widely used approach in machine learning, favoured in applications that require concise and interpretable models. Heuristic methods are traditionally used to quickly produce models with reasonably high accuracy. A commonly criticised point, however, is that the resulting trees may not necessarily be the best representation of the data in terms of accuracy, size, and other considerations such as fairness. In recent years, this motivated the development of optimal classification tree algorithms that globally optimise the decision tree in contrast to heuristic methods that perform a sequence of locally optimal decisions. We follow this line of work and provide a novel algorithm for learning optimal classification trees based on dynamic programming and search. Our algorithm supports constraints on the depth of the tree and number of nodes and we argue it can be extended with other requirements. The success of our approach is attributed to a series of specialised techniques that exploit properties unique to classification trees. Whereas algorithms for optimal classification trees have traditionally been plagued by high runtimes and limited scalability, we show in a detailed experimental study that our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances, providing several orders of magnitude improvements and notably contributing towards the practical realisation of optimal decision trees.


Classification with Random Forests in Python

#artificialintelligence

The random forests algorithm is a machine learning method that can be used for supervised learning tasks such as classification and regression. The algorithm works by constructing a set of decision trees trained on random subsets of features. In the case of classification, the output of a random forest model is the mode of the predicted classes across the decision trees. In this post, we will discuss how to build random forest models for classification tasks in python. In this post, you'll see Classification with Random Forests in Python The random forests algorithm is a machine learning method that can be used for supervised learning tasks such as classification and regression.


A Nonparametric Test of Dependence Based on Ensemble of Decision Trees

arXiv.org Machine Learning

A general purpose method to detect statistical dependence, or correlation, between random variables has invaluable uses in a wide array of sciences and applications (Li, 2000; Martínez-Gómez et al., 2014; Mahdi et al., 2012). Linear correlation (Pearson, 1920) is one of the oldest statistical methods that are still widely used today. Though the assumption of linearity is not always realistic, the popularity of such method stems from its ease of computation, simplicity, interpretability, and high power when the assumption of linearity is satisfied. Several approaches have been proposed to quantify correlation, in the general case, for more complex relationships and under less stringent assumptions. Examples of these methods are the kernel based correlation (Hardoon et al., 2004; Chang et al., 2013), copula methods (Poczos et al., 2012), distance correlation (Székely et al., 2007; Székely and Rizzo, 2009), and discretization based mutual information (MI) (Steuer et al., 2002) methods such as the maximal information criterion (MIC) (Reshef et al., 2011). Issues that can be lacking in some of the existing methods include: low statistical power, high computation demand, lack of intuitive interpretability, or lack of a known distribution of the coefficient under independence that would enable computing a statistical confidence. More thorough details on the pros and cons of those methods and others can be found in several studies (de Siqueira Santos et al., 2014; N. Reshef et al., 2018).


A complete explanation of Random Forest Algorithm.

#artificialintelligence

Ensemble learning is a technique where there is a joining of different types of algorithm or same types of algorithm and then it forms a more powerful regression and classification model. Here, in the random forest algorithm, it combines with multiple decision trees and forms a model. Because of its diversity and simplicity, it is one of the most used algorithms. It is used for both classification and regression problems.


How to Develop a Bagging Ensemble with Python

#artificialintelligence

Bagging is an ensemble machine learning algorithm that combines the predictions from many decision trees. It is also easy to implement given that it has few key hyperparameters and sensible heuristics for configuring these hyperparameters. Bagging performs well in general and provides the basis for a whole field of ensemble of decision tree algorithms such as the popular random forest and extra trees ensemble algorithms, as well as the lesser-known Pasting, Random Subspaces, and Random Patches ensemble algorithms. In this tutorial, you will discover how to develop Bagging ensembles for classification and regression. How to Develop a Bagging Ensemble in Python Photo by daveynin, some rights reserved. Bootstrap Aggregation, or Bagging for short, is an ensemble machine learning algorithm. Specifically, it is an ensemble of decision tree models, although the bagging technique can also be used to combine the predictions of other types of models.


Machine Learning: Decision Trees

#artificialintelligence

This blog covers another interesting machine learning algorithm called Decision Trees and it's mathematical implementation. At every point in our life, we make some decisions to proceed further. Similarly, this machine learning algorithm also makes the same decisions on the dataset provided and figures out the best splitting or decision at each step to improve the accuracy and make better decisions. This, in turn, helps in giving valuable results. A decision tree is a machine learning algorithm which represents a hierarchical division of dataset to form a tree based on certain parameters.


Machine Learning Basics: Random Forest Regression

#artificialintelligence

Previously, I had explained the various Regression models such as Linear, Polynomial, Support Vector and Decision Tree Regression. In this article, we will go through the code for the application of Random Forest Regression which is an extension to the Decision Tree Regression implemented previously. The Decision Tree is an easily understood and interpreted algorithm and hence a single tree may not be enough for the model to learn the features from it. On the other hand, Random Forest is also a "Tree"-based algorithm that uses the qualities features of multiple Decision Trees for making decisions. Therefore, it can be referred to as a'Forest' of trees and hence the name "Random Forest".


Pitfalls to Avoid when Interpreting Machine Learning Models

#artificialintelligence

Traditionally, researchers have used parametric models, e.g., linear models, to conduct inference. However, a noticeable shift has happened over the last years towards more non-parametric and non-linear ML models. Practitioners are usually interested in the global effect that features have on the outcome and their importance for correct predictions. For certain model classes, e.g., linear models or decision trees, feature effects or importance scores can be inferred from the learned parameters and model structure. In contrast, complex non-linear models that, e.g., do not have intelligible parameters, make it more difficult to extract such knowledge. Therefore, interpretation methods necessarily simplify the relationships between features and the target, e.g., by marginalizing over other features.


Technologies for Trustworthy Machine Learning: A Survey in a Socio-Technical Context

arXiv.org Artificial Intelligence

Concerns about the societal impact of AI-based services and systems has encouraged governments and other organisations around the world to propose AI policy frameworks to address fairness, accountability, transparency and related topics. To achieve the objectives of these frameworks, the data and software engineers who build machine-learning systems require knowledge about a variety of relevant supporting tools and techniques. In this paper we provide an overview of technologies that support building trustworthy machine learning systems, i.e., systems whose properties justify that people place trust in them. We argue that four categories of system properties are instrumental in achieving the policy objectives, namely fairness, explainability, auditability and safety & security (FEAS). We discuss how these properties need to be considered across all stages of the machine learning life cycle, from data collection through run-time model inference. As a consequence, we survey in this paper the main technologies with respect to all four of the FEAS properties, for data-centric as well as model-centric stages of the machine learning system life cycle. We conclude with an identification of open research problems, with a particular focus on the connection between trustworthy machine learning technologies and their implications for individuals and society.


Using Continuous Machine Learning to Run Your ML Pipeline

#artificialintelligence

CI/CD is a key concept that is becoming increasingly popular and widely adopted in the software industry nowadays. Incorporating continuous integration and deployment for a software project that doesn't contain a machine learning component is fairly straightforward because the stages of the pipeline are somewhat standard, and it is unlikely that the CI/CD pipeline will change a lot over the course of development. But, when the project involves a machine learning component, this may not be true. As opposed to traditional software development, building a pipeline for a machine learning components may involve a lot of changes over time, mostly in response to observations made during past iterations of development. Therefore, for ML projects, notebooks are widely used to get started with the project, and once a stable foundation (base code for different stages of the ML pipeline) is available to build upon, the code is pushed to a version control system, and the pipeline is migrated to a CI/CD tool such as Jenkins or TravisCI.