Goto

Collaborating Authors

 Ensemble Learning


Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

#artificialintelligence

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting.


Using Machine Learning in Venture Capital

#artificialintelligence

I have already (partially) reviewed previous studies where data have been proved to help identify signals that are relevant to assess the success potential of a startup. Even though the list is quite comprehensive, every study usually tends to look at one single factor and a couple of different success scenarios (namely, acquisition and IPO). In our work, we tried to have a more holistic view and use over 120,000 companies to spot signals not only for acquisitions and IPOs but also to compute the probability of raising a subsequent round of funding or shutting the startup down. In the same fashion as backtesting, we created a time-aware approach and analyzed companies that were no older than four years old by 2015 and tried to predict their success in the following three years. We also used more than a hundred variables as possible explanatory indicators of success, as well as five different models: Support Vector Machines (SVM); Decision Trees (DT); Random Forests (RF); Extremely Randomized Trees (ERT); and Gradient Tree Boosting (GTB).


Identifying Cancer Patients at Risk for Heart Failure Using Machine Learning Methods

arXiv.org Machine Learning

Cardiotoxicity related to cancer therapies has become a serious issue, diminishing cancer treatment outcomes and quality of life. Early detection of cancer patients at risk for cardiotoxicity before cardiotoxic treatments and providing preventive measures are potential solutions to improve cancer patients's quality of life. This study focuses on predicting the development of heart failure in cancer patients after cancer diagnoses using historical electronic health record (EHR) data. We examined four machine learning algorithms using 143,199 cancer patients from the University of Florida Health (UF Health) Integrated Data Repository (IDR). We identified a total number of 1,958 qualified cases and matched them to 15,488 controls by gender, age, race, and major cancer type. Two feature encoding strategies were compared to encode variables as machine learning features. The gradient boosting (GB) based model achieved the best AUC score of 0.9077 (with a sensitivity of 0.8520 and a specificity of 0.8138), outperforming other machine learning methods. We also looked into the subgroup of cancer patients with exposure to chemotherapy drugs and observed a lower specificity score (0.7089). The experimental results show that machine learning methods are able to capture clinical factors that are known to be associated with heart failure and that it is feasible to use machine learning methods to identify cancer patients at risk for cancer therapy-related heart failure.


Machine Truth Serum

arXiv.org Artificial Intelligence

Wisdom of the crowd revealed a striking fact that the majority answer from a crowd is often more accurate than any individual expert. We observed the same story in machine learning--ensemble methods leverage this idea to combine multiple learning algorithms to obtain better classification performance. Among many popular examples is the celebrated Random Forest, which applies the majority voting rule in aggregating different decision trees to make the final prediction. Nonetheless, these aggregation rules would fail when the majority is more likely to be wrong. In this paper, we extend the idea proposed in Bayesian Truth Serum that "a surprisingly more popular answer is more likely the true answer" to classification problems. The challenge for us is to define or detect when an answer should be considered as being "surprising". We present two machine learning aided methods which aim to reveal the truth when it is minority instead of majority who has the true answer. Our experiments over real-world datasets show that better classification performance can be obtained compared to always trusting the majority voting. Our proposed methods also outperform popular ensemble algorithms. Our approach can be generically applied as a subroutine in ensemble methods to replace majority voting rule.


Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records

#artificialintelligence

We used longitudinal data from linked electronic health records of 4.6 million patients aged 18โ€“100 years from 389 practices across England between 1985 to 2015. The population was divided into a derivation cohort (80%, 3.75 million patients from 300 general practices) and a validation cohort (20%, 0.88 million patients from 89 general practices) from geographically distinct regions with different risk levels. We first replicated a previously reported Cox proportional hazards (CPH) model for prediction of the risk of the first emergency admission up to 24 months after baseline. This reference model was then compared with 2 machine learning models, random forest (RF) and gradient boosting classifier (GBC). The initial set of predictors for all models included 43 variables, including patient demographics, lifestyle factors, laboratory tests, currently prescribed medications, selected morbidities, and previous emergency admissions.


Learning to Tune XGBoost with XGBoost

arXiv.org Machine Learning

In this short paper we investigate whether meta-learning techniques can be used to more effectively tune the hyperparameters of machine learning models using successive halving (SH). We propose a novel variant of the SH algorithm (MeSH), that uses meta-regressors to determine which candidate configurations should be eliminated at each round. We apply MeSH to the problem of tuning the hyperparameters of a gradient-boosted decision tree model. By training and tuning our meta-regressors using existing tuning jobs from 95 datasets, we demonstrate that MeSH can often find a superior solution to both SH and random search.


InterpretML: A Unified Framework for Machine Learning Interpretability

arXiv.org Machine Learning

InterpretML is an open-source Python package which exposes machine learning interpretability algorithms to practitioners and researchers. InterpretML exposes two types of interpretability - glassbox models, which are machine learning models designed for interpretability (ex: linear models, rule lists, generalized additive models), and blackbox explainability techniques for explaining existing systems (ex: Partial Dependence, LIME). The package enables practitioners to easily compare interpretability algorithms by exposing multiple methods under a unified API, and by having a built-in, extensible visualization platform. InterpretML also includes the first implementation of the Explainable Boosting Machine, a powerful, interpretable, glassbox model that can be as accurate as many blackbox models. The MIT licensed source code can be downloaded from github.com/microsoft/interpret.


Voting with Random Classifiers (VORACE)

arXiv.org Artificial Intelligence

In many machine learning scenarios, looking for the best classifier that fits a particular dataset can be very costly in terms of time and resources. Moreover, it can require deep knowledge of the specific domain. We propose a new technique which does not require profound expertise in the domain and avoids the commonly used strategy of hyper-parameter tuning and model selection. Our method is an innovative ensemble technique that uses voting rules over a set of randomly-generated classifiers. Given a new input sample, we interpret the output of each classifier as a ranking over the set of possible classes. We then aggregate these output rankings using a voting rule, which treats them as preferences over the classes. We show that our approach obtains good results compared to the state-of-the-art, both providing a theoretical analysis and an empirical evaluation of the approach on several datasets.


8 Parameters to Qualify AI Solutions SalesChoice

#artificialintelligence

One way could be to identify some of the most critical parameters to look for in any AI solution, and to rate/label them on a standard scale. Few such parameters are discussed below. Perhaps the community and policymakers can crystallize these further, and add to the list. Decision trees, Random forest, Gradient boosting, Monte Carlo, to name a few. The use of any one of these (say, Regression) in a solution can technically qualify it as AI-enabled, but it would not be very accurate or useful for a user. This has led to disillusionment among early AI users, while also giving rise to plethora of solutions and companies calling themselves AI.


Many Heads Are Better Than One: The Case For Ensemble Learning

#artificialintelligence

"The interests of truth require a diversity of opinions." Banks and lenders are increasingly turning to AI and machine learning to automate their core functions and make more accurate predictions in credit underwriting and fraud detection. ML practitioners can take advantage of a growing number of modeling algorithms, such as simple decision trees, random forests, gradient boosting machines, deep neural networks, and support vector machines. Each method has its strengths and weaknesses, which is why it often makes sense to combine ML algorithms to provide even greater predictive performance than any single ML method could provide on its own. This method of combining algorithms is known as ensembling.