Goto

Collaborating Authors

 Decision Tree Learning


Random Forest for Dissimilarity-based Multi-view Learning

arXiv.org Machine Learning

Many classification problems are naturally multi-view in the sense their data are described through multiple heterogeneous descriptions. For such tasks, dissimilarity strategies are effective ways to make the different descriptions comparable and to easily merge them, by (i) building intermediate dissimilarity representations for each view and (ii) fusing these representations by averaging the dissimilarities over the views. In this work, we show that the Random Forest proximity measure can be used to build the dissimilarity representations, since this measure reflects similarities between features but also class membership. We then propose a Dynamic View Selection method to better combine the view-specific dissimilarity representations. This allows to take a decision, on each instance to predict, with only the most relevant views for that instance. Experiments are conducted on several real-world multi-view datasets, and show that the Dynamic View Selection offers a significant improvement in performance compared to the simple average combination and two state-of-the-art static view combinations.


Programming by Rewards

arXiv.org Artificial Intelligence

We formalize and study ``programming by rewards'' (PBR), a new approach for specifying and synthesizing subroutines for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run), that assigns a reward for each execution. The goal of the synthesizer is to synthesize a "decision function" $f$ which transforms the features to a decision value for the black-box component so as to maximize the expected reward $E[r \circ f (x)]$ for executing decisions $f(x)$ for various values of $x$. We consider a space of decision functions in a DSL of loop-free if-then-else programs, which can branch on linear functions of the input features in a tree-structure and compute a linear function of the inputs in the leaves of the tree. We find that this DSL captures decision functions that are manually written in practice by programmers. Our technical contribution is the use of continuous-optimization techniques to perform synthesis of such decision functions as if-then-else programs. We also show that the framework is theoretically-founded ---in cases when the rewards satisfy nice properties, the synthesized code is optimal in a precise sense. We have leveraged PBR to synthesize non-trivial decision functions related to search and ranking heuristics in the PROSE codebase (an industrial strength program synthesis framework) and achieve competitive results to manually written procedures over multiple man years of tuning. We present empirical evaluation against other baseline techniques over real-world case studies (including PROSE) as well on simple synthetic benchmarks.


Misclassification cost-sensitive ensemble learning: A unifying framework

arXiv.org Machine Learning

The task of supervised machine learning is given a set of recorded observations and their outcomes to predict the outcome of new observations. Standard classification techniques aim for the highest overall accuracy or, equivalently, for the smallest total error, and include among others support vector machines, Bayesian classifiers, logistic regression, decision tree classifiers such as CART [6] and C4.5 [38], and ensemble methods which build several classifiers and aggregate their predictions such as Bagging [4], AdaBoost [16] and Random Forests [5]. Of particular interest in certain domains are binary classifiers which deal with cases where only two classes of outcomes are considered, such as fraudulent and legitimate credit card transactions, responders and non-responders to a marketing campaign, patients with and without cancer, intrusive and authorised network access, and defaulting and repaying debtors to name a few. In most of these cases, one of the classes is a small minority and consequently traditional classifiers might classify all of its members as belonging to the majority class without any significant overall accuracy loss. The severity of this class imbalance becomes more noticeable when failing to correctly predict a minority class member is more costly than doing so with a member of the majority class, as the case often is. A remedy to the undesirable situation just described are classifiers which, instead of accuracy, take misclassification costs into account and are thus termed cost-sensitive. We illustrate this idea in the credit card fraud detection framework: accepting a fraudulent transaction as legitimate incurs a cost equal to its amount.


A unified survey on treatment effect heterogeneity modeling and uplift modeling

arXiv.org Machine Learning

A central question in many fields of scientific research is to determine how an outcome would be affected by an action, or to measure the effect of an action (a.k.a treatment effect). In recent years, a need for estimating the heterogeneous treatment effects conditioning on the different characteristics of individuals has emerged from research fields such as personalized healthcare, social science, and online marketing. To meet the need, researchers and practitioners from different communities have developed algorithms by taking the treatment effect heterogeneity modeling approach and the uplift modeling approach, respectively. In this paper, we provide a unified survey of these two seemingly disconnected yet closely related approaches under the potential outcome framework. We then provide a structured survey of existing methods by emphasizing on their inherent connections with a set of unified notations to make comparisons of the different methods easy. We then review the main applications of the surveyed methods in personalized marketing, personalized medicine, and social studies. Finally, we summarize the existing software packages and present discussions based on the use of methods on synthetic, semi-synthetic and real world data sets and provide some general guidelines for choosing methods.


Machine Learning Pipelines with Azure ML Studio

#artificialintelligence

Machine Learning Pipelines with Azure ML Studio What can Azure ML pipelines do? In this project-based course, you are going to build an end-to-end machine learning pipeline in Azure ML Studio, all without writing a single line of code! This course uses the Adult Income Census data set to train a model to predict an individual's income. It predicts whether an individual's annual income is greater than or less than $50,000. The estimator used in this project is a Two-Class Boosted Decision Tree classifier.


Prospective evaluation of an artificial intelligence-enabled algorithm for automated diabetic retinopathy screening of 30 000 patients

#artificialintelligence

Background/aims Human grading of digital images from diabetic retinopathy (DR) screening programmes represents a significant challenge, due to the increasing prevalence of diabetes. We evaluate the performance of an automated artificial intelligence (AI) algorithm to triage retinal images from the English Diabetic Eye Screening Programme (DESP) into test-positive/technical failure versus test-negative, using human grading following a standard national protocol as the reference standard. Methods Retinal images from 30ย 405 consecutive screening episodes from three English DESPs were manually graded following a standard national protocol and by an automated process with machine learning enabled software, EyeArt v2.1. Screening performance (sensitivity, specificity) and diagnostic accuracy (95% CIs) were determined using human grades as the reference standard. Results Sensitivity (95% CIs) of EyeArt was 95.7% (94.8% to 96.5%) for referable retinopathy (human graded ungradable, referable maculopathy, moderate-to-severe non-proliferative or proliferative). This comprises sensitivities of 98.3% (97.3% to 98.9%) for mild-to-moderate non-proliferative retinopathy with referable maculopathy, 100% (98.7%,100%) for moderate-to-severe non-proliferative retinopathy and 100% (97.9%,100%) for proliferative disease. EyeArt agreed with the human grade of no retinopathy (specificity) in 68% (67% to 69%), with a specificity of 54.0% (53.4% to 54.5%) when combined with non-referable retinopathy. Conclusion The algorithm demonstrated safe levels of sensitivity for high-risk retinopathy in a real-world screening service, with specificity that could halve the workload for human graders. AI machine learning and deep learning algorithms such as this can provide clinically equivalent, rapid detection of retinopathy, particularly in settings where a trained workforce is unavailable or where large-scale and rapid results are needed.


Editable AI: Mixed Human-AI Authoring of Code Patterns

arXiv.org Artificial Intelligence

Developers authoring HTML documents define elements following patterns which establish and reflect the visual structure of a document, such as making all images in a footer the same height by applying a class to each. To surface these patterns to developers and support developers in authoring consistent with these patterns, we propose a mixed human-AI technique for creating code patterns. Patterns are first learned from individual HTML documents through a decision tree, generating a representation which developers may view and edit. Code patterns are used to offer developers autocomplete suggestions, list examples, and flag violations. To evaluate our technique, we conducted a user study in which 24 participants wrote, edited, and corrected HTML documents. We found that our technique enabled developers to edit and correct documents more quickly and create, edit, and correct documents more successfully.


Joints in Random Forests

arXiv.org Artificial Intelligence

Decision Trees (DTs) and Random Forests (RFs) are powerful discriminative learners and tools of central importance to the everyday machine learning practitioner and data scientist. Due to their discriminative nature, however, they lack principled methods to process inputs with missing features or to detect outliers, which requires pairing them with imputation techniques or a separate generative model. In this paper, we demonstrate that DTs and RFs can naturally be interpreted as generative models, by drawing a connection to Probabilistic Circuits, a prominent class of tractable probabilistic models. This reinterpretation equips them with a full joint distribution over the feature space and leads to Generative Decision Trees (GeDTs) and Generative Forests (GeFs), a family of novel hybrid generative-discriminative models. This family of models retains the overall characteristics of DTs and RFs while additionally being able to handle missing features by means of marginalisation. Under certain assumptions, frequently made for Bayes consistency results, we show that consistency in GeDTs and GeFs extend to any pattern of missing input features, if missing at random. Empirically, we show that our models often outperform common routines to treat missing data, such as K-nearest neighbour imputation, and moreover, that our models can naturally detect outliers by monitoring the marginal probability of input features.


Towards Robust Classification with Deep Generative Forests

arXiv.org Machine Learning

Decision Trees and Random Forests are among the most widely used machine learning models, and often achieve state-of-the-art performance in tabular, domain-agnostic datasets. Nonetheless, being primarily discriminative models they lack principled methods to manipulate the uncertainty of predictions. In this paper, we exploit Generative Forests (GeFs), a recent class of deep probabilistic models that addresses these issues by extending Random Forests to generative models representing the full joint distribution over the feature space. We demonstrate that GeFs are uncertainty-aware classifiers, capable of measuring the robustness of each prediction as well as detecting out-of-distribution samples.


Random Ensemble Machine Learning in Python: Random Udemy

#artificialintelligence

Ensemble Machine Learning in Python: Random Forest, AdaBoost 4.6 (1,193 ratings) Course Ratings are calculated from individual students' ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. In recent years, we've seen a resurgence in AI, or artificial intelligence, and machine learning. Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts. Google's AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning. Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever.