Ensemble Learning
Adaptive Generation Model: A New Ensemble Method
As a common method in Machine Learning, Ensemble Method is used to train multiple models from a data set and obtain better results through certain combination strategies. Stacking method, as representatives of Ensemble Learning methods, is often used in Machine Learning Competitions such as Kaggle. This paper proposes a variant of Stacking Model based on the idea of gcForest, namely Adaptive Generation Model (AGM). It means that the adaptive generation is performed not only in the horizontal direction to expand the width of each layer model, but also in the vertical direction to expand the depth of the model. For base models of AGM, they all come from preset basic Machine Learning Models. In addition, a feature augmentation method is added between layers to further improve the overall accuracy of the model. Finally, through comparative experiments on 7 data sets, the results show that the accuracy of AGM are better than its previous models.
To Bag is to Prune
It is notoriously hard to build a bad Random Forest (RF). Concurrently, RF is perhaps the only standard ML algorithm that blatantly overfits in-sample without any consequence out-of-sample. Standard arguments cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a (latent) true underlying tree. More generally, there is no need to tune the stopping point of a properly randomized ensemble of greedily optimized base learners. Thus, Boosting and MARS are eligible for automatic (implicit) tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles yield an out-of-sample performance equivalent to that of their tuned counterparts -- or better.
An explainable XGBoost-based approach towards assessing the risk of cardiovascular disease in patients with Type 2 Diabetes Mellitus
Athanasiou, Maria, Sfrintzeri, Konstantina, Zarkogianni, Konstantia, Thanopoulou, Anastasia C., Nikita, Konstantina S.
Cardiovascular Disease (CVD) is an important cause of disability and death among individuals with Diabetes Mellitus (DM). International clinical guidelines for the management of Type 2 DM (T2DM) are founded on primary and secondary prevention and favor the evaluation of CVD related risk factors towards appropriate treatment initiation. CVD risk prediction models can provide valuable tools for optimizing the frequency of medical visits and performing timely preventive and therapeutic interventions against CVD events. The integration of explainability modalities in these models can enhance human understanding on the reasoning process, maximize transparency and embellish trust towards the models' adoption in clinical practice. The aim of the present study is to develop and evaluate an explainable personalized risk prediction model for the fatal or non-fatal CVD incidence in T2DM individuals. An explainable approach based on the eXtreme Gradient Boosting (XGBoost) and the Tree SHAP (SHapley Additive exPlanations) method is deployed for the calculation of the 5-year CVD risk and the generation of individual explanations on the model's decisions. Data from the 5-year follow up of 560 patients with T2DM are used for development and evaluation purposes. The obtained results (AUC = 71.13%) indicate the potential of the proposed approach to handle the unbalanced nature of the used dataset, while providing clinically meaningful insights about the ensemble model's decision process.
Artificial Intelligence Helps Cut Down on MRI No-shows
Weekly outpatient MRI appointment no-show rates for 1 year before (19.3%) and 6 months after (15.9%) implementation of intervention measures in March 2019, as guided by XGBoost prediction model. September 10, 2020 -- According to ARRS' American Journal of Roentgenology (AJR), artificial intelligence (AI) predictive analytics performed moderately well in solving complex multifactorial operational problems -- outpatient MRI appointment no-shows, especially -- using a modest amount of data and basic feature engineering. "Such data may be readily retrievable from frontline information technology systems commonly used in most hospital radiology departments, and they can be readily incorporated into routine workflow practice to improve the efficiency and quality of health care delivery," wrote lead author Le Roy Chong of Singapore's Changi General Hospital. To train and validate their model, Chong and colleagues extracted records of 32,957 outpatient MRI appointments scheduled between January 2016 and December 2018 from their institution's radiology information system, while acquiring a further holdout test set of 1,080 records from January 2019. Overall, the no-show rate was 17.4%.
That looks interesting! Personalizing Communication and Segmentation with Random Forest Node Embeddings
Wang, Weiwei, Eberhardt, Wiebke, Bromuri, Stefano
Communicating effectively with customers is a challenge for many marketers, but especially in a context that is both pivotal to individual long-term financial well-being and difficult to understand: pensions. Around the world, participants are reluctant to consider their pension in advance, it leads to a lack of preparation of their pension retirement [1], [2]. In order to engage participants to obtain information on their expected pension benefits, personalizing the pension providers' email communication is a first and crucial step. We describe a machine learning approach to model email newsletters to fit participants' interests. The data for the modeling and analysis is collected from newsletters sent by a large Dutch pension provider of the Netherlands and is divided into two parts. The first part comprises 2,228,000 customers whereas the second part comprises the data of a pilot study, which took place in July 2018 with 465,711 participants. In both cases, our algorithm extracts features from continuous and categorical data using random forests, and then calculates node embeddings of the decision boundaries of the random forest. We illustrate the algorithm's effectiveness for the classification task, and how it can be used to perform data mining tasks. In order to confirm that the result is valid for more than one data set, we also illustrate the properties of our algorithm in benchmark data sets concerning churning. In the data sets considered, the proposed modeling demonstrates competitive performance with respect to other state of the art approaches based on random forests, achieving the best Area Under the Curve (AUC) in the pension data set (0.948). For the descriptive part, the algorithm can identify customer segmentations that can be used by marketing departments to better target their communication towards their customers.
Random boosting and random^2 forests -- A random tree depth injection approach
Krabel, Tobias Markus, Tran, Thi Ngoc Tien, Groll, Andreas, Horn, Daniel, Jentsch, Carsten
The induction of additional randomness in parallel and sequential ensemble methods has proven to be worthwhile in many aspects. In this manuscript, we propose and examine a novel random tree depth injection approach suitable for sequential and parallel tree-based approaches including Boosting and Random Forests. The resulting methods are called \emph{Random Boost} and \emph{Random$^2$ Forest}. Both approaches serve as valuable extensions to the existing literature on the gradient boosting framework and random forests. A Monte Carlo simulation, in which tree-shaped data sets with different numbers of final partitions are built, suggests that there are several scenarios where \emph{Random Boost} and \emph{Random$^2$ Forest} can improve the prediction performance of conventional hierarchical boosting and random forest approaches. The new algorithms appear to be especially successful in cases where there are merely a few high-order interactions in the generated data. In addition, our simulations suggest that our random tree depth injection approach can improve computation time by up to 40%, while at the same time the performance losses in terms of prediction accuracy turn out to be minor or even negligible in most cases.
Using Machine Learning to Predict Car Accidents
Road accidents constitute a significant proportion of the number of serious injuries reported every year. Yet, it is often challenging to determine which specific conditions lead to such events, making it more difficult for local law enforcement to address the number and severity of road accidents. We all know that some characteristics of vehicles and the surroundings play a key role (engine capacity, condition of the road, etc.). However, many questions are still open. Which of these factors are the leading ones?
Artificial intelligence helps cut down on MRI no-shows
According to ARRS' American Journal of Roentgenology (AJR), artificial intelligence (AI) predictive analytics performed moderately well in solving complex multifactorial operational problems--outpatient MRI appointment no-shows, especially--using a modest amount of data and basic feature engineering. "Such data may be readily retrievable from frontline information technology systems commonly used in most hospital radiology departments, and they can be readily incorporated into routine workflow practice to improve the efficiency and quality of health care delivery," wrote lead author Le Roy Chong of Singapore's Changi General Hospital. To train and validate their model, Chong and colleagues extracted records of 32,957 outpatient MRI appointments scheduled between January 2016 and December 2018 from their institution's radiology information system, while acquiring a further holdout test set of 1,080 records from January 2019. Overall, the no-show rate was 17.4%. After evaluating various machine learning predictive models developed with widely used open-source software tools, Chong and team deployed a decision tree-based ensemble algorithm that uses a gradient boosting framework: XGBoost, version 0.80 [Tianqi Chen].
DART: Data Addition and Removal Trees
Brophy, Jonathan, Lowd, Daniel
How can we update data for a machine learning model after it has already trained on that data? In this paper, we introduce DART, a variant of random forests that supports adding and removing training data with minimal retraining. Data updates in DART are exact, meaning that adding or removing examples from a DART model yields exactly the same model as retraining from scratch on updated data. DART uses two techniques to make updates efficient. The first is to cache data statistics at each node and training data at each leaf, so that only the necessary subtrees are retrained. The second is to choose the split variable randomly at the upper levels of each tree, so that the choice is completely independent of the data and never needs to change. At the lower levels, split variables are chosen to greedily maximize a split criterion such as Gini index or mutual information. By adjusting the number of random-split levels, DART can trade off between more accurate predictions and more efficient updates. In experiments on ten real-world datasets and one synthetic dataset, we find that DART is orders of magnitude faster than retraining from scratch while sacrificing very little in terms of predictive performance.
Mitigating Bias in Machine Learning: An introduction to MLFairnessPipeline
Bias takes many different forms and impact all groups of people. It can range from implicit to explicit and is often very difficult to detect. In the field of machine learning bias is often subtle and hard to identify, let alone solve. Why is this a problem? Implicit bias in machine learning has very real consequences including denial of a loan, a lengthier prison sentence, and many other harmful outcomes for underprivileged groups.