Decision Tree Learning
A discriminative approach for finding and characterizing positivity violations using decision trees
Karavani, Ehud, Bak, Peter, Shimoni, Yishai
The assumption of positivity in causal inference (also known as common support and co-variate overlap) is necessary to obtain valid causal estimates. Therefore, confirming it holds in a given dataset is an important first step of any causal analysis. Most common methods to date are insufficient for discovering non-positivity, as they do not scale for modern high-dimensional covariate spaces, or they cannot pinpoint the subpopulation violating positivity. To overcome these issues, we suggest to harness decision trees for detecting violations. By dividing the covariate space into mutually exclusive regions, each with maximized homogeneity of treatment groups, decision trees can be used to automatically detect subspaces violating positivity. By augmenting the method with an additional random forest model, we can quantify the robustness of the violation within each subspace. This solution is scalable and provides an interpretable characterization of the subspaces in which violations occur. We provide a visualization of the stratification rules that define each subpopulation, combined with the severity of positivity violation within it. We also provide an interactive version of the visualization that allows a deeper dive into the properties of each subspace.
XGBoost and Random Forest with Bayesian Optimisation
Instead of only comparing XGBoost and Random Forest in this post we will try to explain how to use those two very popular approaches with Bayesian Optimisation and that are those models main pros and cons. XGBoost (XGB) and Random Forest (RF) both are ensemble learning methods and predict (classification or regression) by combining the outputs from individual decision trees (we assume tree-based XGB or RF). XGBoost build decision tree one each time. Each new tree corrects errors which were made by previously trained decision tree. At Addepto we use XGBoost models to solve anomaly detection problems e.g. in supervised learning approach.
Top Machine Learning and Data Science Methods Used at Work
The practice of data science requires the use algorithms and data science methods to help data professionals extract insights and value from data. A recent survey by Kaggle revealed that data professionals used data visualization, logistic regression, cross-validation and decision trees more than other data science methods in 2017. Looking ahead to 2018, data professionals are most interested in learning deep learning (41%). Kaggle conducted a survey in August 2017 of over 16,000 data professionals (2017 State of Data Science and Machine Learning). Their survey included a variety of questions about data science, machine learning, education and more.
Contrastive Explanations for Large Errors in Retail Forecasting Predictions through Monte Carlo Simulations
Lucic, Ana, Haned, Hinda, de Rijke, Maarten
At Ahold Delhaize, there is an interest in using more complex machine learning techniques for sales forecasting. It is difficult to convince analysts, along with their superiors, to adopt these techniques since the models are considered to be 'black boxes,' even if they perform better than current models in use. We aim to explore the impact of contrastive explanations about large errors on users' attitudes towards a 'black-box' model. In this work, we make two contributions. The first is an algorithm, Monte Carlo Bounds for Reasonable Predictions (MC-BRP). Given a large error, MC-BRP determines (1) feature values that would result in a reasonable prediction, and (2) general trends between each feature and the target, based on Monte Carlo simulations. The second contribution is the evaluation of MC-BRP along with its outcomes, which has both objective and subjective components. We evaluate on a real dataset with real users from Ahold Delhaize by conducting a user study to determine if explanations generated by MC-BRP help users understand why a prediction results in a large error, and if this promotes trust in an automatically-learned model. The study shows that users are able to answer objective questions about the model's predictions with overall 81.7% accuracy when provided with these contrastive explanations. We also show that users who saw MC-BRP explanations understand why the model makes large errors in predictions significantly more than users in the control group.
Online Local Boosting: improving performance in online decision trees
da Costa, Victor G. Turrisi, Mastelini, Saulo Martiello, de Carvalho, Andrรฉ C. Ponce de Leon Ferreira, Barbon, Sylvio Jr
As more data are produced each day, and faster, data stream mining is growing in importance, making clear the need for algorithms able to fast process these data. Data stream mining algorithms are meant to be solutions to extract knowledge online, specially tailored from continuous data problem. Many of the current algorithms for data stream mining have high processing and memory costs. Often, the higher the predictive performance, the higher these costs. To increase predictive performance without largely increasing memory and time costs, this paper introduces a novel algorithm, named Online Local Boosting (OLBoost), which can be combined into online decision tree algorithms to improve their predictive performance without modifying the structure of the induced decision trees. For such, OLBoost applies a boosting to small separate regions of the instances space. Experimental results presented in this paper show that by using OLBoost the online learning decision tree algorithms can significantly improve their predictive performance. Additionally, it can make smaller trees perform as good or better than larger trees.
The Quantum Version Of Classification Decision Tree Constructing Algorithm C5.0
Khadiev, Kamil, Mannapov, Ilnaz, Safina, Liliya
In the paper, we focus on complexity of C5.0 algorithm for constructing decision tree classifier that is the models for the classification problem from machine learning. In classical case the decision tree is constructed in $O(hd(NM+N \log N))$ running time, where $M$ is a number of classes, $N$ is the size of a training data set, $d$ is a number of attributes of each element, $h$ is a tree height. Firstly, we improved the classical version, the running time of the new version is $O(h\cdot d\cdot N\log N)$. Secondly, we suggest a quantum version of this algorithm, which uses quantum subroutines like the amplitude amplification and the D{\"u}rr-H{\o}yer minimum search algorithms that are based on Grover's algorithm. The running time of the quantum algorithm is $O\big(h\cdot \sqrt{d}\log d \cdot N \log N\big)$ that is better than complexity of the classical algorithm.
FAHT: An Adaptive Fairness-aware Decision Tree Classifier
Zhang, Wenbin, Ntoutsi, Eirini
Automated data-driven decision-making systems are ubiquitous across a wide spread of online as well as offline services. These systems, depend on sophisticated learning algorithms and available data, to optimize the service function for decision support assistance. However, there is a growing concern about the accountability and fairness of the employed models by the fact that often the available historic data is intrinsically discriminatory, i.e., the proportion of members sharing one or more sensitive attributes is higher than the proportion in the population as a whole when receiving positive classification, which leads to a lack of fairness in decision support system. A number of fairness-aware learning methods have been proposed to handle this concern. However, these methods tackle fairness as a static problem and do not take the evolution of the underlying stream population into consideration. In this paper, we introduce a learning mechanism to design a fair classifier for online stream based decision-making. Our learning model, FAHT (Fairness-Aware Hoeffding Tree), is an extension of the well-known Hoeffding Tree algorithm for decision tree induction over streams, that also accounts for fairness. Our experiments show that our algorithm is able to deal with discrimination in streaming environments, while maintaining a moderate predictive performance over the stream.
What's wrong with the approach to Data Science?
Data science is the application of statistics, programming and domain knowledge to generate insights into a problem that needs to be solved. The Harvard Business Review said Data Scientist is the sexiest job of the 21st century. How often has that article been referenced to convince people? The job'Data Scientist' has been around for decades, it was just not called "Data Scientist". Statisticians have used their knowledge and skills using machine learning techniques such as Logistic Regression and Random Forest for prediction and insights for decades.
Best Split Nodes for Regression Trees
Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. This paper aims to study the bias and adaptive properties of regression trees constructed with CART. In doing so, we derive an interesting connection between the bias and the mean decrease in impurity (MDI) measure of variable importance---a tool widely used for model interpretability---defined as the sum of impurity reductions over all non-terminal nodes in the tree. In particular, we show that the size of a terminal subnode for a variable is small when the MDI for that variable is large and that this relationship is exponential---confirming theoretically that decision trees with CART have small bias and are adaptive to signal strength and direction. Finally, we apply these individual tree bounds to tree ensembles and show consistency of Breiman's random forests. The context is surprisingly general and applies to a wide variety of multivariable data generating distributions and regression functions. The main technical tool is an exact characterization of the conditional probability content of the daughter nodes arising from an optimal split, in terms of the partial dependence function and reduction in impurity.
Heart of Darkness: Logistic Regression vs. Random Forest
The'functional needs repair' category of the target variable only makes up about 7% of the whole set. The implication is that whatever algorithm you end up using it's probably going to learn the other two balanced classes a lot better than this one. Such is data science: the struggle is real. The first thing we're going to do is create an'age' variable for the waterpoints as that seems highly relevant. The'population' variable also has a highly right-skewed distribution so we're going to change that as well: The zeros inside of the'amount_tsh' are also probably NaNs so we're going to do something drastic and simplify it into 0s and 1s: One of the most important points we learned from the week before and something that will stay with me is the idea of coming up with a baseline model as fast as one can.