Decision Tree Learning
Synthetic Sampling for Multi-Class Malignancy Prediction
Yung, Matthew, Brown, Eli T., Rasin, Alexander, Furst, Jacob D., Raicu, Daniela S.
We explore several oversampling techniques for an imbalanced multi-label classification problem, a setting often encountered when developing models for Computer-Aided Diagnosis (CADx) systems. While most CADx systems aim to optimize classifiers for overall accuracy without considering the relative distribution of each class, we look into using synthetic sampling to increase per-class performance when predicting the degree of malignancy. Using low-level image features and a random forest classifier, we show that using synthetic oversampling techniques increases the sensitivity of the minority classes by an average of 7.22% points, with as much as a 19.88% point increase in sensitivity for a particular minority class. Furthermore, the analysis of low-level image feature distributions for the synthetic nodules reveals that these nodules can provide insights on how to preprocess image data for better classification performance or how to supplement the original datasets when more data acquisition is feasible.
Generating Synthetic but Plausible Healthcare Record Datasets
Aviรฑรณ, Laura, Ruffini, Matteo, Gavaldร , Ricard
Generating datasets that "look like" given real ones is an interesting tasks for healthcare applications of ML and many other fields of science and engineering. In this paper we propose a new method of general application to binary datasets based on a method for learning the parameters of a latent variable moment that we have previously used for clustering patient datasets. We compare our method with a recent proposal (MedGan) based on generative adversarial methods and find that the synthetic datasets we generate are globally more realistic in at least two senses: real and synthetic instances are harder to tell apart by Random Forests, and the MMD statistic. The most likely explanation is that our method does not suffer from the "mode collapse" which is an admitted problem of GANs. Additionally, the generative models we generate are easy to interpret, unlike the rather obscure GANs. Our experiments are performed on two patient datasets containing ICD-9 diagnostic codes: the publicly available MIMIC-III dataset and a dataset containing admissions for congestive heart failure during 7 years at Hospital de Sant Pau in Barcelona.
Machine learning predicts World Cup winner
The random-forest technique has emerged in recent years as a powerful way to analyze large data sets while avoiding some of the pitfalls of other data-mining methods. It is based on the idea that some future event can be determined by a decision tree in which an outcome is calculated at each branch by reference to a set of training data. However, decision trees suffer from a well-known problem. In the latter stages of the branching process, decisions can become severely distorted by training data that is sparse and prone to huge variation at this kind of resolution, a problem known as overfitting. The random-forest approach is different.
A real-time decision support system for bridge management based on the rules generalized by CART decision tree and SMO algorithms
Abpeykar, Shadi, Ghatee, Mehdi
Under dynamic conditions on bridges, we need a real-time management. To this end, this paper presents a rule-based decision support system in which the necessary rules are extracted from simulation results made by Aimsun traffic micro-simulation software. Then, these rules are generalized by the aid of fuzzy rule generation algorithms. Then, they are trained by a set of supervised and the unsupervised learning algorithms to get an ability to make decision in real cases. As a pilot case study, Nasr Bridge in Tehran is simulated in Aimsun and WEKA data mining software is used to execute the learning algorithms. Based on this experiment, the accuracy of the supervised algorithms to generalize the rules is greater than 80%. In addition, CART decision tree and sequential minimal optimization (SMO) provides 100% accuracy for normal data and these algorithms are so reliable for crisis management on bridge. This means that, it is possible to use such machine learning methods to manage bridges in the real-time conditions.
A Tour of The Top 10 Algorithms for Machine Learning Newbies
In machine learning, there's something called the "No Free Lunch" theorem. In a nutshell, it states that no one algorithm works best for every problem, and it's especially relevant for supervised learning (i.e. For example, you can't say that neural networks are always better than decision trees or vice-versa. There are many factors at play, such as the size and structure of your dataset. As a result, you should try many different algorithms for your problem, while using a hold-out "test set" of data to evaluate performance and select the winner.
Machine learning predicts World Cup winner
The random-forest technique has emerged in recent years as a powerful way to analyze large data sets while avoiding some of the pitfalls of other data-mining methods. It is based on the idea that some future event can be determined by a decision tree in which an outcome is calculated at each branch by reference to a set of training data. However, decision trees suffer from a well-known problem. In the latter stages of the branching process, decisions can become severely distorted by training data that is sparse and prone to huge variation at this kind of resolution, a problem known as overfitting. The random-forest approach is different.
Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution
This paper presents a theoretical study of gradient boosted trees (GBT: Friedman, 2001). Machine learning methods for prediction have generally been thought of as trading off both intelligibility and statistical uncertainty quantification in favor of accuracy. Recent results have started to provide a statistical understanding of methods based on ensembles of decision trees (Breiman et al., 1984). In particular, the consistency of methods related to Random Forests (RFs: Breiman, 2001) has been demonstrated in Biau (2012); Scornet et al. (2015) while Wager et al. (2014); Mentch and Hooker (2016); Wager and Athey (2017) and Athey et al. (2016) prove central limit theorems for RF predictions. These have then been used for tests of variable importance and nonparametric interactions in Mentch and Hooker (2017). In this paper, we extend this analysis to GBT. Analyses of RFs have relied on a subsampling structure to express the estimator in the form of a U-statistic from which central limit theorems can be derived. By contrast, GBT produces trees sequentially with the current tree depending on the values in those built previously, requiring a different analytical approach. While the algorithm proposed in Friedman (2001) is intended to be generally applicable to any loss function, in this paper we focus specifically on nonparametric regression (Stone, 1977, 1982).
Decision Trees in R
Let's imagine you are playing a game of Twenty Questions. Your opponent has secretly chosen a subject, and you must figure out what he/she chose. At each turn, you may ask a yes-or-no question, and your opponent must answer truthfully. How do you find out the secret in the fewest number of questions? It should be obvious some questions are better than others. For example, asking "Can it fly?" as your first question is likely to be unfruitful, whereas asking "Is it alive?" is a bit more useful. Intuitively, you want each question to significantly narrow down the space of possibly secrets, eventually leading to your answer. That is the basic idea behind decision trees. At each point, you consider a set of questions that can partition your data set. You choose the question that provides the best split and again find the best questions for the partitions.
A classification point-of-view about conditional Kendall's tau
Derumigny, Alexis, Fermanian, Jean-David
We show how the problem of estimating conditional Kendall's tau can be rewritten as a classification task. Conditional Kendall's tau is a conditional dependence parameter that is a characteristic of a given pair of random variables. The goal is to predict whether the pair is concordant (value of $1$) or discordant (value of $-1$) conditionally on some covariates. We prove the consistency and the asymptotic normality of a family of penalized approximate maximum likelihood estimators, including the equivalent of the logit and probit regressions in our framework. Then, we detail specific algorithms adapting usual machine learning techniques, including nearest neighbors, decision trees, random forests and neural networks, to the setting of the estimation of conditional Kendall's tau. A small simulation study compares their finite sample properties. Finally, we apply all these estimators to a dataset of European stock indices.
Machine learning predicts World Cup winner
The random-forest technique has emerged in recent years as a powerful way to analyze large data sets while avoiding some of the pitfalls of other data-mining methods. It is based on the idea that some future event can be determined by a decision tree in which an outcome is calculated at each branch by reference to a set of training data. However, decision trees suffer from a well-known problem. In the latter stages of the branching process, decisions can become severely distorted by training data that is sparse and prone to huge variation at this kind of resolution, a problem known as overfitting. The random-forest approach is different.