Goto

Collaborating Authors

 Decision Tree Learning


HDTree: A Customizable and Interactable Decision Tree Written in Python

#artificialintelligence

This story will introduce yet another implementation of Decision Trees, which I wrote as part of my thesis. Firstly, I will try to motivate why I have decided to take my time to come up with an own implementation of Decision Trees; I will list some of its features but also will list the disadvantages of the current implementation. Secondly, I will guide you through the basic usage of HDTree using code snippets and explaining some details along the way. Lastly, there will be some hints on how to customize and extend the HDTree with your own chunks of ideas. However, this article will not guide you through all of the basics of Decision Trees. There are really plenty of resources out there [1][2][3][16].


Back to Machine Learning Basics - Decision Tree & Random Forest

#artificialintelligence

For example, if we have 43 instances of the training set in the node of which 13 belong to one class, while 30 instances belong to a second class. Given that we have only those two classes in the training dataset, we calculate Gini impurity 1 โ€“ (13/43)2 โ€“ (30/43)2 1 โ€“ 0.09 โ€“ 0.49 0.42. When the node is "pure" its Gini index is 0. On the other hand, information gain lets us find the best threshold which will reduce this impurity the most. To calculate information gain we need to calculate average impurity and then subtract that from the starting impurity. That is how we know the quality of thresholds that we used.


On $\ell_p$-norm Robustness of Ensemble Stumps and Trees

arXiv.org Machine Learning

Recent papers have demonstrated that ensemble stumps and trees could be vulnerable to small input perturbations, so robustness verification and defense for those models have become an important research problem. However, due to the structure of decision trees, where each node makes decision purely based on one feature value, all the previous works only consider the $\ell_\infty$ norm perturbation. To study robustness with respect to a general $\ell_p$ norm perturbation, one has to consider the correlation between perturbations on different features, which has not been handled by previous algorithms. In this paper, we study the problem of robustness verification and certified defense with respect to general $\ell_p$ norm perturbations for ensemble decision stumps and trees. For robustness verification of ensemble stumps, we prove that complete verification is NP-complete for $p\in(0, \infty)$ while polynomial time algorithms exist for $p=0$ or $\infty$. For $p\in(0, \infty)$ we develop an efficient dynamic programming based algorithm for sound verification of ensemble stumps. For ensemble trees, we generalize the previous multi-level robustness verification algorithm to $\ell_p$ norm. We demonstrate the first certified defense method for training ensemble stumps and trees with respect to $\ell_p$ norm perturbations, and verify its effectiveness empirically on real datasets.


CHIRPS: Explaining random forest classification

#artificialintelligence

Modern machine learning methods typically produce "black box" models that are opaque to interpretation. Yet, their demand has been increasing in the Human-in-the-Loop processes, that is, those processes that require a human agent to verify, approve or reason about the automated decisions before they can be applied. To facilitate this interpretation, we propose Collection of High Importance Random Path Snippets (CHIRPS); a novel algorithm for explaining random forest classification per data instance. CHIRPS extracts a decision path from each tree in the forest that contributes to the majority classification, and then uses frequent pattern mining to identify the most commonly occurring split conditions. Then a simple, conjunctive form rule is constructed where the antecedent terms are derived from the attributes that had the most influence on the classification.


Never Ignore these 5 Machine Learning Modeling Challenges

#artificialintelligence

Okay, You have decided to build your own machine learning model. You are using Sklearn that is popular machine learning libraries for modeling. But wait do you know the common machine learning modeling challenges faced by every data scientist. No, then you have come to the right place. Here You will know each modeling challenges you face while building the model. When you have a categorical target dataset.


Modeling Text with Decision Forests using Categorical-Set Splits

arXiv.org Machine Learning

Decision forest algorithms model data by learning a binary tree structure recursively where every node splits the feature space into two regions, sending examples into the left or right branches. This "decision" is the result of the evaluation of a condition. For example, a node may split input data by applying a threshold to a numerical feature value. Such decisions are learned using (often greedy) algorithms that attempt to optimize a local loss function. Crucially, whether an algorithm exists to find and evaluate splits for a feature type (e.g., text) determines whether a decision forest algorithm can model that feature type at all. In this work, we set out to devise such an algorithm for textual features, thereby equipping decision forests with the ability to directly model text without the need for feature transformation. Our algorithm is efficient during training and the resulting splits are fast to evaluate with our extension of the QuickScorer inference algorithm. Experiments on benchmark text classification datasets demonstrate the utility and effectiveness of our proposal.


Great Machine Learning Project For Beginners โ€“ Predict NBA Player Position

#artificialintelligence

So now that we've covered the basics of machine learning with regression models, let's move onto something a little more sophisticated: Decision Trees. What is a decision tree you ask? A decision tree is a set of questions you can ask to classify different data points. It's called a tree because it's in a tree like shape, just inverted. If you've got the weather forecast for the day, it'd be pretty easy to look at it and determine if you'd want to go play tennis that day.


Ensemble Forecasting of the Zika Space-TimeSpread with Topological Data Analysis

arXiv.org Machine Learning

As per the records of theWorld Health Organization, the first formally reported incidence of Zika virus occurred in Brazil in May 2015. The disease then rapidly spread to other countries in Americas and East Asia, affecting more than 1,000,000 people. Zika virus is primarily transmitted through bites of infected mosquitoes of the species Aedes (Aedes aegypti and Aedes albopictus). The abundance of mosquitoes and, as a result, the prevalence of Zika virus infections are common in areas which have high precipitation, high temperature, and high population density.Nonlinear spatio-temporal dependency of such data and lack of historical public health records make prediction of the virus spread particularly challenging. In this article, we enhance Zika forecasting by introducing the concepts of topological data analysis and, specifically, persistent homology of atmospheric variables, into the virus spread modeling. The topological summaries allow for capturing higher order dependencies among atmospheric variables that otherwise might be unassessable via conventional spatio-temporal modeling approaches based on geographical proximity assessed via Euclidean distance. We introduce a new concept of cumulative Betti numbers and then integrate the cumulative Betti numbers as topological descriptors into three predictive machine learning models: random forest, generalized boosted regression, and deep neural network. Furthermore, to better quantify for various sources of uncertainties, we combine the resulting individual model forecasts into an ensemble of the Zika spread predictions using Bayesian model averaging. The proposed methodology is illustrated in application to forecasting of the Zika space-time spread in Brazil in the year 2018.


From industry-wide parameters to aircraft-centric on-flight inference: improving aeronautics performance prediction with machine learning

arXiv.org Machine Learning

Aircraft performance models play a key role in airline operations, especially in planning a fuel-efficient flight. In practice, manufacturers provide guidelines which are slightly modified throughout the aircraft life cycle via the tuning of a single factor, enabling better fuel predictions. However this has limitations, in particular they do not reflect the evolution of each feature impacting the aircraft performance. Our goal here is to overcome this limitation. The key contribution of the present article is to foster the use of machine learning to leverage the massive amounts of data continuously recorded during flights performed by an aircraft and provide models reflecting its actual and individual performance. We illustrate our approach by focusing on the estimation of the drag and lift coefficients from recorded flight data. As these coefficients are not directly recorded, we resort to aerodynamics approximations. As a safety check, we provide bounds to assess the accuracy of both the aerodynamics approximation and the statistical performance of our approach. We provide numerical results on a collection of machine learning algorithms. We report excellent accuracy on real-life data and exhibit empirical evidence to support our modelling, in coherence with aerodynamics principles.


neomatrix369/awesome-ai-ml-dl

#artificialintelligence

Contributions are very welcome, please share back with the wider community (and get credited for it)! Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.