Goto

Collaborating Authors

 Decision Tree Learning


Evidence-Based Policy Learning

arXiv.org Machine Learning

The past years have seen seen the development and deployment of machine-learning algorithms to estimate personalized treatment-assignment policies from randomized controlled trials. Yet such algorithms for the assignment of treatment typically optimize expected outcomes without taking into account that treatment assignments are frequently subject to hypothesis testing. In this article, we explicitly take significance testing of the effect of treatment-assignment policies into account, and consider assignments that optimize the probability of finding a subset of individuals with a statistically significant positive treatment effect. We provide an efficient implementation using decision trees, and demonstrate its gain over selecting subsets based on positive (estimated) treatment effects. Compared to standard tree-based regression and classification tools, this approach tends to yield substantially higher power in detecting subgroups with positive treatment effects. INTRODUCTION Recent years have seen the development of machine-learning algorithms that estimate heterogeneous causal effects from randomized controlled trials. While the estimation of average effects - for example, how effective a vaccine is overall, whether a conditional cash transfer reduces poverty, or which ad leads to more clicks - can inform the decision whether to deploy a treatment or not, heterogeneous treatment effect estimation allows us to decide who should get treated. These algorithms aim to maximize realized outcomes, and thus focus on assigning treatment to individuals with positive (estimated) treatment effects. Yet in practice, the deployment of assignment policies often only happens after passing a test that the assignment produces a positive net effect relative to some status quo. For example, a drug manufacturer may have to demonstrate that the drug is effective on the target population by submitting a hypothesis test to the FDA for approval.


Interpretable Data-driven Methods for Subgrid-scale Closure in LES for Transcritical LOX/GCH4 Combustion

arXiv.org Machine Learning

Many practical combustion systems such as those in rockets, gas turbines, and internal combustion engines operate under high pressures that surpass the thermodynamic critical limit of fuel-oxidizer mixtures. These conditions require the consideration of complex fluid behaviors that pose challenges for numerical simulations, casting doubts on the validity of existing subgrid-scale (SGS) models in large-eddy simulations of these systems. While data-driven methods have shown high accuracy as closure models in simulations of turbulent flames, these models are often criticized for lack of physical interpretability, wherein they provide answers but no insight into their underlying rationale. The objective of this study is to assess SGS stress models from conventional physics-driven approaches and an interpretable machine learning algorithm, i.e., the random forest regressor, in a turbulent transcritical non-premixed flame. To this end, direct numerical simulations (DNS) of transcritical liquid-oxygen/gaseous-methane (LOX/GCH4) inert and reacting flows are performed. Using this data, a priori analysis is performed on the Favre-filtered DNS data to examine the accuracy of physics-based and random forest SGS-models under these conditions. SGS stresses calculated with the gradient model show good agreement with the exact terms extracted from filtered DNS. The accuracy of the random-forest regressor decreased when physics-based constraints are applied to the feature set. Results demonstrate that random forests can perform as effectively as algebraic models when modeling subgrid stresses, only when trained on a sufficiently representative database. The employment of random forest feature importance score is shown to provide insight into discovering subgrid-scale stresses through sparse regression.


Optimal Targeting in Fundraising: A Machine Learning Approach

arXiv.org Machine Learning

Fundraising is a costly activity: the largest 25 US charities spend between 5% and 25% of total donations on fundraising expenses (Andreoni and Payne, 2011). These numbers are a matter of concern for two reasons. First, high fundraising costs imply that a smaller proportion of overall donations can finance charitable projects. This effect can lead to an underprovision of the provided goods and services and may, thus, lower welfare if the donors' utility depends on provision levels (Rose-Ackerman, 1982; Name-Correa and Yildirim, 2013). Second, high fundraising costs also matter from the charities' perspectives: it is well documented that donors are averse to financing overhead costs (Tinkelman and Mankaney, 2007; Gneezy et al., 2014). Hence, charities with excessive fundraising expenses will be less successful in raising donations. In conclusion, reducing disproportional fundraising costs can be crucial, both from a welfare and a charity-management perspective. However, while there is a broad literature studying how fundraising instruments such as matching grants and unconditional gifts affect donors' behavior (surveyed by Andreoni and Payne, 2013), previous research has paid less attention to how charities could increase the cost efficacy of fundraising. This paper shifts focus to a novel approach to increase a fundraising campaigns' efficacy: optimal targeting of fundraising activities based on causal machine learning.


Interpretable Machines: Constructing Valid Prediction Intervals with Random Forests

arXiv.org Machine Learning

An important issue when using Machine Learning algorithms in recent research is the lack of interpretability. Although these algorithms provide accurate point predictions for various learning problems, uncertainty estimates connected with point predictions are rather sparse. A contribution to this gap for the Random Forest Regression Learner is presented here. Based on its Out-of-Bag procedure, several parametric and non-parametric prediction intervals are provided for Random Forest point predictions and theoretical guarantees for its correct coverage probability is delivered. In a second part, a thorough investigation through Monte-Carlo simulation is conducted evaluating the performance of the proposed methods from three aspects: (i) Analyzing the correct coverage rate of the proposed prediction intervals, (ii) Inspecting interval width and (iii) Verifying the competitiveness of the proposed intervals with existing methods. The simulation yields that the proposed prediction intervals are robust towards non-normal residual distributions and are competitive by providing correct coverage rates and comparably narrow interval lengths, even for comparably small samples.


Machine Learning with ML.NET - Random Forest

#artificialintelligence

One of the most popular ways to build ensembles is to use the same algorithm multiple times but on the different subsets of the training dataset. Techniques that are used for this are called bagging and pasting. The only difference in these techniques is that while building subsets bagging allows training instances to be sampled several times for the same predictor, while pasting is not allowing that. When all algorithms are trained, the ensemble makes a prediction by aggregating the predictions of all algorithms. In the classification case that is usually the hard-voting process, while for the regression average result is taken.


The Top 10 Machine Learning Algorithms for ML Beginners

#artificialintelligence

Interest in learning machine learning has skyrocketed in the years since Harvard Business Review article named'Data Scientist' the'Sexiest job of the 21st century'. But if you're just starting out in machine learning, it can be a bit difficult to break into. It has been reposted with permission, and was last updated in 2019). This post is targeted towards beginners. If you've got some experience in data science and machine learning, you may be more interested in this more in-depth tutorial on doing machine learning in Python with scikit-learn, or in our machine learning courses, which start here. If you're not clear yet on the differences between "data science" and "machine learning," this article offers a good explanation: machine learning and data science -- what makes them different? Machine learning algorithms are programs that can learn from data and improve from experience, without human intervention.


Efficient Encrypted Inference on Ensembles of Decision Trees

arXiv.org Artificial Intelligence

Data privacy concerns often prevent the use of cloud-based machine learning services for sensitive personal data. While homomorphic encryption (HE) offers a potential solution by enabling computations on encrypted data, the challenge is to obtain accurate machine learning models that work within the multiplicative depth constraints of a leveled HE scheme. Existing approaches for encrypted inference either make ad-hoc simplifications to a pre-trained model (e.g., replace hard comparisons in a decision tree with soft comparators) at the cost of accuracy or directly train a new depth-constrained model using the original training set. In this work, we propose a framework to transfer knowledge extracted by complex decision tree ensembles to shallow neural networks (referred to as DTNets) that are highly conducive to encrypted inference. Our approach minimizes the accuracy loss by searching for the best DTNet architecture that operates within the given depth constraints and training this DTNet using only synthetic data sampled from the training data distribution. Extensive experiments on real-world datasets demonstrate that these characteristics are critical in ensuring that DTNet accuracy approaches that of the original tree ensemble. Our system is highly scalable and can perform efficient inference on batched encrypted (134 bits of security) data with amortized time in milliseconds. This is approximately three orders of magnitude faster than the standard approach of applying soft comparison at the internal nodes of the ensemble trees.


Slow-Growing Trees

arXiv.org Machine Learning

Random Forest's performance can be matched by a single slow-growing tree (SGT), which uses a learning rate to tame CART's greedy algorithm. SGT exploits the view that CART is an extreme case of an iterative weighted least square procedure. Moreover, a unifying view of Boosted Trees (BT) and Random Forests (RF) is presented. Greedy ML algorithms' outcomes can be improved using either "slow learning" or diversification. SGT applies the former to estimate a single deep tree, and Booging (bagging stochastic BT with a high learning rate) uses the latter with additive shallow trees. The performance of this tree ensemble quaternity (Booging, BT, SGT, RF) is assessed on simulated and real regression tasks.


MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA

arXiv.org Machine Learning

Variable importance measures are the main tools to analyze the black-box mechanism of random forests. Although the Mean Decrease Accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its theoretical properties. In fact, the exact MDA definition varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. In particular, we break down these limits in three components: the first two are related to Sobol indices, which are well-defined measures of a variable contribution to the output variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within input variables. Thus, we theoretically demonstrate that the MDA does not target the right quantity when inputs are dependent, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-MDA, which fixes the flaws of the original MDA. We prove the consistency of the Sobol-MDA and show its good empirical performance through experiments on both simulated and real data. An open source implementation in R and C++ is available online.


Machine Learning 101: Decision Tree Algorithm for Classification

#artificialintelligence

The decision tree Algorithm belongs to the family of supervised machine learning algorithms. It can be used for both a classification problem as well as for regression problem. The goal of this algorithm is to create a model that predicts the value of a target variable, for which the decision tree uses the tree representation to solve the problem in which the leaf node corresponds to a class label and attributes are represented on the internal node of the tree. It will split our data into two branches High and Normal based on cholesterol, as you can see in the above figure. Let's suppose our new patient has high cholesterol by the above split of our data we cannot say whether Drug B or Drug A will be suitable for the patient.