Machine learning and data science require more than just throwing data into a python library and utilizing whatever comes out. Data scientists need to actually understand the data and the processes behind the data to be able to implement a successful system. One key methodology to implementation is knowing when a model might benefit from utilizing bootstrapping methods. These are what are called ensemble models. Some examples of ensemble models are AdaBoost and Stochastic Gradient Boosting.
Medication non-adherence is a widespread problem affecting over 50% of people who have chronic illness and need chronic treatment. Non-adherence exacerbates health risks and drives significant increases in treatment costs. In order to address these challenges, the importance of predicting patients' adherence has been recognised. In other words, it is important to improve the efficiency of interventions of the current healthcare system by prioritizing resources to the patients who are most likely to be non-adherent. Our objective in this work is to make predictions regarding individual patients' behaviour in terms of taking their medication on time during their next scheduled medication opportunity. We do this by leveraging a number of machine learning models. In particular, we demonstrate the use of a connected IoT device; a "Smart Sharps Bin", invented by HealthBeacon Ltd.; to monitor and track injection disposal of patients in their home environment. Using extensive data collected from these devices, five machine learning models, namely Extra Trees Classifier, Random Forest, XGBoost, Gradient Boosting and Multilayer Perception were trained and evaluated on a large dataset comprising 165,223 historic injection disposal records collected from 5,915 HealthBeacon units over the course of 3 years. The testing work was conducted on real-time data generated by the smart device over a time period after the model training was complete, i.e. true future data. The proposed machine learning approach demonstrated very good predictive performance exhibiting an Area Under the Receiver Operating Characteristic Curve (ROC AUC) of 0.86.
The problem of all-relevant feature selection is concerned with finding a relevant feature set with preserved redundancies. There exist several approximations to solve this problem but only one could give a distinction between strong and weak relevance. This approach was limited to the case of linear problems. In this work, we present a new solution for this distinction in the non-linear case through the use of random forest models and statistical methods.
Bootstrap aggregation (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite learner. This article proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for both classification and regression. BoostTree constructs a tree by gradient boosting, which trains a linear or nonlinear model at each node. When a new sample comes in, BoostTree first sorts it down to a leaf, then computes the final prediction by summing up the outputs of all models along the path from the root node to that leaf. BoostTree achieves high randomness (diversity) by sampling its parameters randomly from a parameter pool, and selecting a subset of features randomly at node splitting. BoostForest further increases the randomness by bootstrapping the training data in constructing different BoostTrees. BoostForest is compared with four classical ensemble learning approaches on 30 classification and regression datasets, demonstrating that it can generate more accurate and more robust composite learners.
Artificial intelligence has been applied in wildfire science and management since the 1990s, with early applications including neural networks and expert systems. Since then the field has rapidly progressed congruently with the wide adoption of machine learning (ML) in the environmental sciences. Here, we present a scoping review of ML in wildfire science and management. Our objective is to improve awareness of ML among wildfire scientists and managers, as well as illustrate the challenging range of problems in wildfire science available to data scientists. We first present an overview of popular ML approaches used in wildfire science to date, and then review their use in wildfire science within six problem domains: 1) fuels characterization, fire detection, and mapping; 2) fire weather and climate change; 3) fire occurrence, susceptibility, and risk; 4) fire behavior prediction; 5) fire effects; and 6) fire management. We also discuss the advantages and limitations of various ML approaches and identify opportunities for future advances in wildfire science and management within a data science context. We identified 298 relevant publications, where the most frequently used ML methods included random forests, MaxEnt, artificial neural networks, decision trees, support vector machines, and genetic algorithms. There exists opportunities to apply more current ML methods (e.g., deep learning and agent based learning) in wildfire science. However, despite the ability of ML models to learn on their own, expertise in wildfire science is necessary to ensure realistic modelling of fire processes across multiple scales, while the complexity of some ML methods requires sophisticated knowledge for their application. Finally, we stress that the wildfire research and management community plays an active role in providing relevant, high quality data for use by practitioners of ML methods.
Accurate predictions of reactive mixing are critical for many Earth and environmental science problems. To investigate mixing dynamics over time under different scenarios, a high-fidelity, finite-element-based numerical model is built to solve the fast, irreversible bimolecular reaction-diffusion equations to simulate a range of reactive-mixing scenarios. A total of 2,315 simulations are performed using different sets of model input parameters comprising various spatial scales of vortex structures in the velocity field, time-scales associated with velocity oscillations, the perturbation parameter for the vortex-based velocity, anisotropic dispersion contrast, and molecular diffusion. Outputs comprise concentration profiles of the reactants and products. The inputs and outputs of these simulations are concatenated into feature and label matrices, respectively, to train 20 different machine learning (ML) emulators to approximate system behavior. The 20 ML emulators based on linear methods, Bayesian methods, ensemble learning methods, and multilayer perceptron (MLP), are compared to assess these models. The ML emulators are specifically trained to classify the state of mixing and predict three quantities of interest (QoIs) characterizing species production, decay, and degree of mixing. Linear classifiers and regressors fail to reproduce the QoIs; however, ensemble methods (classifiers and regressors) and the MLP accurately classify the state of reactive mixing and the QoIs. Among ensemble methods, random forest and decision-tree-based AdaBoost faithfully predict the QoIs. At run time, trained ML emulators are $\approx10^5$ times faster than the high-fidelity numerical simulations. Speed and accuracy of the ensemble and MLP models facilitate uncertainty quantification, which usually requires 1,000s of model run, to estimate the uncertainty bounds on the QoIs.
Large-scale collections of electronic records constitute both an opportunity for the development of more accurate prediction models and a threat for privacy. To limit privacy exposure new privacy-enhancing techniques are emerging such as federated learning which enables large-scale data analysis while avoiding the centralization of records in a unique database that would represent a critical point of failure. Although promising regarding privacy protection, federated learning prevents using some data-cleaning algorithms thus inducing new biases. In this work we focus on the recurrent problem of duplicated records that, if not handled properly, may cause over-optimistic estimations of a model's performances. We introduce and discuss stratified cross-validation, a validation methodology that leverages stratification techniques to prevent data leakage in federated learning settings without relying on demanding deduplication algorithms.
Current prognostic risk scores in cardiac surgery are based on statistics and do not yet benefit from machine learning. Statistical predictors are not robust enough to correctly identify patients who would benefit from Transcatheter Aortic Valve Implantation (TAVI). This research aims to create a machine learning model to predict one-year mortality of a patient after TAVI. We adopt a modern gradient boosting on decision trees algorithm, specifically designed for categorical features. In combination with a recent technique for model interpretations, we developed a feature analysis and selection stage, enabling to identify the most important features for the prediction. We base our prediction model on the most relevant features, after interpreting and discussing the feature analysis results with clinical experts. We validated our model on 270 TAVI cases, reaching an AUC of 0.83. Our approach outperforms several widespread prognostic risk scores, such as logistic EuroSCORE II, the STS risk score and the TAVI2-score, which are broadly adopted by cardiologists worldwide.
Tax evasion is the illegal non-payment of taxes by individuals, corporations, and trusts. It results in a loss of state revenue that can undermine the effectiveness of government policies. One measure of tax evasion is the so-called tax gap: the difference between the income that should be reported to the tax authorities and the amount actually reported. However, economists lack a robust method for estimating the tax gap through a bottom-up approach based on fiscal audits. This is difficult because the declared tax base is available on the whole population but the income reported to the tax authorities is generally available only on a small, non-random sample of audited units. This induces a selection bias which invalidates standard statistical methods. Here, we use machine learning based on a 2-steps Gradient Boosting model, to correct for the selection bias without requiring any strong assumption on the distribution. We use our method to estimate the Italian VAT Gap related to individual firms based on information gathered from administrative sources. Our algorithm estimates the potential VAT turnover of Italian individual firms for the fiscal year 2011 and suggests that the tax gap is about 30% of the total potential tax base. Comparisons with other methods show our technique offers a significant improvement in predictive performance.
Fair classification has become an important topic in machine learning research. While most bias mitigation strategies focus on neural networks, we noticed a lack of work on fair classifiers based on decision trees even though they have proven very efficient. In an up-to-date comparison of state-of-the-art classification algorithms in tabular data, tree boosting outperforms deep learning. For this reason, we have developed a novel approach of adversarial gradient tree boosting. The objective of the algorithm is to predict the output $Y$ with gradient tree boosting while minimizing the ability of an adversarial neural network to predict the sensitive attribute $S$. The approach incorporates at each iteration the gradient of the neural network directly in the gradient tree boosting. We empirically assess our approach on 4 popular data sets and compare against state-of-the-art algorithms. The results show that our algorithm achieves a higher accuracy while obtaining the same level of fairness, as measured using a set of different common fairness definitions.