Goto

Collaborating Authors

 Ensemble Learning



Fair Adversarial Gradient Tree Boosting

arXiv.org Artificial Intelligence

--Fair classification has become an important topic in machine learning research. While most bias mitigation strategies focus on neural networks, we noticed a lack of work on fair classifiers based on decision trees even though they have proven very efficient. In an up-to-date comparison of state-of- the-art classification algorithms in tabular data, tree boosting outperforms deep learning [1]. For this reason, we have developed a novel approach of adversarial gradient tree boosting. The objective of the algorithm is to predict the output Y with gradient tree boosting while minimizing the ability of an adversarial neural network to predict the sensitive attribute S . The approach incorporates at each iteration the gradient of the neural network directly in the gradient tree boosting. We empirically assess our approach on 4 popular data sets and compare against state-of- the-art algorithms. The results show that our algorithm achieves a higher accuracy while obtaining the same level of fairness, as measured using a set of different common fairness definitions. I NTRODUCTION Machine learning models are increasingly used in decision making processes. In many fields of application, they generally deliver superior performance compared with conventional, deterministic algorithms. However, those models are mostly black boxes which are hard, if not impossible, to interpret.


Purifying Interaction Effects with the Functional ANOVA: An Efficient Algorithm for Recovering Identifiable Additive Models

arXiv.org Artificial Intelligence

Recent methods for training generalized additive models (GAMs) with pairwise interactions achieve state-of-the-art accuracy on a variety of datasets. Adding interactions to GAMs, however, introduces an identifiability problem: effects can be freely moved between main effects and interaction effects without changing the model predictions. In some cases, this can lead to contradictory interpretations of the same underlying function. This is a critical problem because a central motivation of GAMs is model interpretability. In this paper, we use the Functional ANOV A decomposition to uniquely define interaction effects and thus produce identifiable additive models with purified interactions. To compute this decomposition, we present a fast, exact, mass-moving algorithm that transforms any piecewise-constant function (such as a tree-based model) into a purified, canonical representation. We apply this algorithm to several datasets and show large disparity, including contradictions, between the apparent and the purified effects. An important question in data analysis is whether two variables act in concert to affect an outcome. But this unconstrained additive model has fundamental flaws.


Machine learning identifies patients in need of end-of-life planning

#artificialintelligence

Penn Medicine researchers have developed a machine learning algorithm that identifies oncology patients at risk of short-term mortality who need end-of-life conversations with clinicians. In a study of 26,525 patients receiving outpatient oncology care, the algorithm accurately predicted patients with cancer who were at risk of six-month mortality using electronic health records, including whether a patient had high blood pressure as well as laboratory and electrocardiogram data. The study found that 51 percent of the patients the algorithm identified as "high priority" for end-of-life conversations died within six months vs. fewer than 4 percent in the "lower priority" group. "Our findings suggest that ML tools hold promise for integration into clinical workflows to ensure that patients with cancer have timely conversations about their goals and values," concludes the study, which was published in the journal JAMA Network Open. Initially, researchers developed, validated and compared three ML models--gradient boosting, logistic regression and random forest--to estimate six-month mortality among patients seen in oncology clinics affiliated with a large academic cancer center. However, the random forest model in the study demonstrated the best predictive results.


Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

#artificialintelligence

The sustained success random forests has led naturally to the desire to better understand the statistical and mathematical properties of the procedure. Lin and Jeon (2006) introduced the potential nearest neighbor framework and Biau and Devroye (2010) later established related consistency properties. In the last several years, a number of important statistical properties of random forests have also been established whenever base learners are constructed with subsamples rather than bootstrap samples. Scornet et al. (2015) provided the first consistency result for Breiman's original random forest algorithm whenever the true underlying regression function is assumed to be additive. Despite the impressive volume of research from the past two decades and the exciting recent progress in establishing their statistical properties, a satisfying explanation for the sustained empirical success of random forests has yet to be provided.


Privacy-Preserving Gradient Boosting Decision Trees

arXiv.org Machine Learning

The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. \textit{Sensitivity} and \textit{privacy budget} are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines.


Simplifying Random Forests: On the Trade-off between Interpretability and Accuracy

arXiv.org Machine Learning

We analyze the trade-off between model complexity and accuracy for random forests by breaking the trees up into individual classification rules and selecting a subset of them. We show experimentally that already a few rules are sufficient to achieve an acceptable accuracy close to that of the original model. Moreover, our results indicate that in many cases, this can lead to simpler models that clearly outperform the original ones.


Practical Federated Gradient Boosting Decision Trees

arXiv.org Machine Learning

Gradient Boosting Decision Trees (GBDTs) have become very successful in recent years, with many awards in machine learning and data mining competitions. There have been several recent studies on how to train GBDTs in the federated learning setting. In this paper, we focus on horizontal federated learning, where data samples with the same features are distributed among multiple parties. However, existing studies are not efficient or effective enough for practical use. They suffer either from the inefficiency due to the usage of costly data transformations such as secure sharing and homomorphic encryption, or from the low model accuracy due to differential privacy designs. In this paper, we study a practical federated environment with relaxed privacy constraints. In this environment, a dishonest party might obtain some information about the other parties' data, but it is still impossible for the dishonest party to derive the actual raw data of other parties. Specifically, each party boosts a number of trees by exploiting similarity information based on locality-sensitive hashing. We prove that our framework is secure without exposing the original record to other parties, while the computation overhead in the training process is kept low. Our experimental studies show that, compared with normal training with the local data of each owner, our approach can significantly improve the predictive accuracy, and achieve comparable accuracy to the original GBDT with the data from all parties.


A Comprehensive Guide to Random Forest in R

#artificialintelligence

Classification is the method of predicting the class of a given input data point. Classification problems are common in machine learning and they fall under the Supervised learning method.


Why You Should Build XGBoost Models Within H2O - Sefik Ilkin Serengil

#artificialintelligence

XGBoost triggered the rise of the tree based models in the machine learning world. It earns reputation with its robust models. Its built models mostly get almost 2% more accuracy. On the other hand, it is a fact that XGBoost is almost 10 times slower than LightGBM. Speed means a lot in a data challenge.