Ensemble Learning
Machine Learning Approaches to Predict 6-Month Mortality Among Patients With Cancer
Question Can machine learning algorithms identify oncology patients at risk of short-term mortality to inform timely conversations between patients and physicians regrading serious illness? Findings In this cohort study of 26 525 patients seen in oncology practices within a large academic health system, machine learning algorithms accurately identified patients at high risk of 6-month mortality with good discrimination and positive predictive value. When the gradient boosting algorithm was applied in real time, most patients who were classified as having high risk were deemed appropriate by oncology clinicians for a conversation regarding serious illness. Meaning In this study, machine learning algorithms accurately identified patients with cancer who were at risk of 6-month mortality, suggesting that these models could facilitate more timely conversations between patients and physicians regarding goals and values. Importance Machine learning algorithms could identify patients with cancer who are at risk of short-term mortality. However, it is unclear how different machine learning algorithms compare and whether they could prompt clinicians to have timely conversations about treatment and end-of-life preferences. Objectives To develop, validate, and compare machine learning algorithms that use structured electronic health record data before a clinic visit to predict mortality among patients with cancer. Design, Setting, and Participants Cohort study of 26 525 adult patients who had outpatient oncology or hematology/oncology encounters at a large academic cancer center and 10 affiliated community practices between February 1, 2016, and July 1, 2016.
Beating the S&P500 Using Machine Learning
A machine learning algorithm written in Python was designed to predict which companies from the S&P 1500 index are likely to beat the S&P 500 index on a monthly basis. To do so, a random forest regression based algorithm, taking as input the financial ratios of all the constituents of the S&P 1500, was implemented. We will therefore skip step 1 in this article. Those with access to the datasets through the required subscriptions can instead refer to the complete notebook hosted on the following Github project: SP1500StockPicker. The random forest method is based on multiple decision trees.
Probabilistic Hydrological Post-Processing at Scale: Why and How to Apply Machine-Learning Quantile Regression Algorithms
We conduct a large-scale benchmark experiment aiming to advance the use of machine-learning quantile regression algorithms for probabilistic hydrological post-processing "at scale" within operational contexts. The experiment is set up using 34-year-long daily time series of precipitation, temperature, evapotranspiration and streamflow for 511 catchments over the contiguous United States. Point hydrological predictions are obtained using the Génie Rural à 4 paramètres Journalier (GR4J) hydrological model and exploited as predictor variables within quantile regression settings. Six machine-learning quantile regression algorithms and their equal-weight combiner are applied to predict conditional quantiles of the hydrological model errors. The individual algorithms are quantile regression, generalized random forests for quantile regression, generalized random forests for quantile regression emulating quantile regression forests, gradient boosting machine, model-based boosting with linear models as base learners and quantile regression neural networks.
Gradient Boosted Decision Tree Neural Network
Saberian, Mohammad, Delgado, Pablo, Raimond, Yves
In this paper we propose a method to build a neural network that is similar to an ensemble of decision trees. We first illustrate how to convert a learned ensemble of decision trees to a single neural network with one hidden layer and an input transformation. We then relax some properties of this network such as thresholds and activation functions to train an approximately equivalent decision tree ensemble. The final model, Hammock, is surprisingly simple: a fully connected two layers neural network where the input is quantized and one-hot encoded. Experiments on large and small datasets show this simple method can achieve performance similar to that of Gradient Boosted Decision Trees.
Migration through Machine Learning Lens -- Predicting Sexual and Reproductive Health Vulnerability of Young Migrants
Nigam, Amber, Jaiswal, Pragati, Girkar, Uma, Arora, Teertha, Celi, Leo A.
In this paper, we have discussed initial findings and results of our experiment to predict sexual and reproductive health vulnerabilities of migrants in a data-constrained environment. Notwithstanding the limited research and data about migrants and migration cities, we propose a solution that simultaneously focuses on data gathering from migrants, augmenting awareness of the migrants to reduce mishaps, and setting up a mechanism to present insights to the key stakeholders in migration to act upon. We have designed a webapp for the stakeholders involved in migration: migrants, who would participate in data gathering process and can also use the app for getting to know safety and awareness tips based on analysis of the data received; public health workers, who would have an access to the database of migrants on the app; policy makers, who would have a greater understanding of the ground reality, and of the patterns of migration through machine-learned analysis. Finally, we have experimented with different machine learning models on an artificially curated dataset. We have shown, through experiments, how machine learning can assist in predicting the migrants at risk and can also help in identifying the critical factors that make migration dangerous for migrants. The results for identifying vulnerable migrants through machine learning algorithms are statistically significant at an alpha of 0.05.
How I scored in the top 1% of Kaggle's Titanic Machine Learning Challenge
You don't need to reinvent the wheel, you need to know how to use the wheel to make your car better. The Titanic challenge hosted by Kaggle is a competition in which the goal is to predict the survival or the death of a given passenger based on a set of variables describing him such as his age, his sex, or his passenger class on the boat. I have been playing with the Titanic dataset for a while. As I'm writing this post, I am ranked 113th out of 11002 participants. You must be wondering how did I manage to achieve this.
How I scored in the top 1% of Kaggle's Titanic Machine Learning Challenge
You don't need to reinvent the wheel, you need to know how to use the wheel to make your car better. The Titanic challenge hosted by Kaggle is a competition in which the goal is to predict the survival or the death of a given passenger based on a set of variables describing him such as his age, his sex, or his passenger class on the boat. I have been playing with the Titanic dataset for a while. As I'm writing this post, I am ranked 113th out of 11002 participants. You must be wondering how did I manage to achieve this.
Machine Learning for Generalizable Prediction of Flood Susceptibility
Sidrane, Chelsea, Fitzpatrick, Dylan J, Annex, Andrew, O'Donoghue, Diane, Gal, Yarin, Biliński, Piotr
Flooding is a destructive and dangerous hazard and climate change appears to be increasing the frequency of catastrophic flooding events around the world. Physics-based flood models are costly to calibrate and are rarely generalizable across different river basins, as model outputs are sensitive to site-specific parameters and human-regulated infrastructure. In contrast, statistical models implicitly account for such factors through the data on which they are trained. Such models trained primarily from remotely-sensed Earth observation data could reduce the need for extensive in-situ measurements. In this work, we develop generalizable, multi-basin models of river flooding susceptibility using geographically-distributed data from the USGS stream gauge network. Machine learning models are trained in a supervised framework to predict two measures of flood susceptibility from a mix of river basin attributes, impervious surface cover information derived from satellite imagery, and historical records of rainfall and stream height. We report prediction performance of multiple models using precision-recall curves, and compare with performance of naive baselines. This work on multi-basin flood prediction represents a step in the direction of making flood prediction accessible to all at-risk communities.
Breadth-first, Depth-next Training of Random Forests
Anghel, Andreea, Ioannou, Nikolas, Parnell, Thomas, Papandreou, Nikolaos, Mendler-Dünner, Celestine, Pozidis, Haris
In this paper we analyze, evaluate, and improve the performance of training Random Forest (RF) models on modern CPU architectures. An exact, state-of-the-art binary decision tree building algorithm is used as the basis of this study. Firstly, we investigate the trade-offs between using different tree building algorithms, namely breadth-first-search (BFS) and depth-search-first (DFS). We design a novel, dynamic, hybrid BFS-DFS algorithm and demonstrate that it performs better than both BFS and DFS, and is more robust in the presence of workloads with different characteristics. Secondly, we identify CPU performance bottlenecks when generating trees using this approach, and propose optimizations to alleviate them. The proposed hybrid tree building algorithm for RF is implemented in the Snap Machine Learning framework, and speeds up the training of RFs by 7.8x on average when compared to state-of-the-art RF solvers (sklearn, H2O, and xgboost) on a range of datasets, RF configurations, and multi-core CPU architectures.