Ensemble Learning
Converting Handwritten Math Symbols into Text Using Random Forest
The Inspiration: Is it fair to say mathematicians are averse to technology? My lifelong love for math inevitably led me to an undergraduate study in mathematics. Soon after taking my first college statistics course, I realized I also had a knack for understanding and interpreting data, as well as coding in the programming language R. After graduating with a Mathematics B.Sc., I became a high school teacher. Even though I can truly say I enjoyed what I did, I still felt the need to search for a more technically challenging career path.
Gradient Boosting on Decision Trees for Mortality Prediction in Transcatheter Aortic Valve Implantation
Mamprin, Marco, Zelis, Jo M., Tonino, Pim A. L., Zinger, Svitlana, de With, Peter H. N.
Current prognostic risk scores in cardiac surgery are based on statistics and do not yet benefit from machine learning. Statistical predictors are not robust enough to correctly identify patients who would benefit from Transcatheter Aortic Valve Implantation (TAVI). This research aims to create a machine learning model to predict one-year mortality of a patient after TAVI. We adopt a modern gradient boosting on decision trees algorithm, specifically designed for categorical features. In combination with a recent technique for model interpretations, we developed a feature analysis and selection stage, enabling to identify the most important features for the prediction. We base our prediction model on the most relevant features, after interpreting and discussing the feature analysis results with clinical experts. We validated our model on 270 TAVI cases, reaching an AUC of 0.83. Our approach outperforms several widespread prognostic risk scores, such as logistic EuroSCORE II, the STS risk score and the TAVI2-score, which are broadly adopted by cardiologists worldwide.
XGBoost: Enhancement Over Gradient Boosting Machines
XGBoost was originally developed by Tianqi Chen in his paper titeled "XGBoost: A Scalable Tree Boosting System." XGBoost itself is an enhancement to the gradient boosting algorithm created by Jerome H. Friedman in his paper titled "Greedy Function Approximation: A Gradient Boosting Machine." Both papers are well worth exploring.
Prediction of Drug Synergy by Ensemble Learning
One of the promising methods for the treatment of complex diseases such as cancer is combinational therapy. Due to the combinatorial complexity, machine learning models can be useful in this field, where significant improvements have recently been achieved in determination of synergistic combinations. In this study, we investigate the effectiveness of different compound representations in predicting the drug synergy. On a large drug combination screen dataset, we first demonstrate the use of a promising representation that has not been used for this problem before, then we propose an ensemble on representation-model combinations that outperform each of the baseline models. 1 Scientific Background A drug combination is called synergistic if the effect of the drug combination on the reference cell is greater than the total effect taken from the administration of the individual drugs. If the opposite situation is observed, the drug combination is called antagonistic . Understanding whether a combination is antagonistic or synergistic is a resource and time intensive task.
CatBoostLSS -- An extension of CatBoost to probabilistic forecasting
We propose a new framework of CatBoost that predicts the entire conditional distribution of a univariate response variable. In particular, CatBoostLSS models all moments of a parametric distribution (i.e., mean, location, scale and shape [LSS]) instead of the conditional mean only. Choosing from a wide range of continuous, discrete and mixed discrete-continuous distributions, modelling and predicting the entire conditional distribution greatly enhances the flexibility of CatBoost, as it allows to gain insight into the data generating process, as well as to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. We present both a simulation study and real-world examples that demonstrate the benefits of our approach.
Aleatoric and Epistemic Uncertainty with Random Forests
Shaker, Mohammad Hossein, Hüllermeier, Eyke
Due to the steadily increasing relevance of machine learning for practical applications, many of which are coming with safety requirements, the notion of uncertainty has received increasing attention in machine learning research in the last couple of years. In particular, the idea of distinguishing between two important types of uncertainty, often refereed to as aleatoric and epistemic, has recently been studied in the setting of supervised learning. In this paper, we propose to quantify these uncertainties with random forests. More specifically, we show how two general approaches for measuring the learner's aleatoric and epistemic uncertainty in a prediction can be instantiated with decision trees and random forests as learning algorithms in a classification setting. In this regard, we also compare random forests with deep neural networks, which have been used for a similar purpose.
Using Gradient Boosting for Time Series prediction tasks
Time series prediction problems are pretty frequent in the retail domain. Companies like Walmart and Target need to keep track of how much product should be shipped from Distribution Centres to stores. Even a small improvement in such a demand forecasting system can help save a lot of dollars in term of workforce management, inventory cost and out of stock loss. While there are many techniques to solve this particular problem like ARIMA, Prophet, and LSTMs, we can also treat such a problem as a regression problem too and use trees to solve it. In this post, we will try to solve the time series problem using XGBoost.
Large Random Forests: Optimisation for Rapid Evaluation
Gossen, Frederik, Steffen, Bernhard
Random Forests are one of the most popular classifiers in machine learning. The larger they are, the more precise is the outcome of their predictions. However, this comes at a cost: their running time for classification grows linearly with the number of trees, i.e. the size of the forest. In this paper, we propose a method to aggregate large Random Forests into a single, semantically equivalent decision diagram. Our experiments on various popular datasets show speed-ups of several orders of magnitude, while, at the same time, also significantly reducing the size of the required data structure.
XGBoost: An Intuitive Explanation
We all know how XGBoost dominates in Kaggle competitions due to its performance and speed. This blog is about understanding how XGBoost works (try to explain the research paper). This blog is not about how to code/ implement XGboost or how to tune its hyperparameters. XGBoost stands for eXtreme Gradient Boosting. It explains bagging (bootstrap aggregating) and boosting (Adaptive Boosting).
Machine Learning Transition Temperatures from 2D Structure
A priori knowledge of melting and boiling could expedite the discovery of pharmaceutical, energetic, and energy harvesting materials. The tools of data science are becoming increasingly important for exploring chemical datasets and predicting material properties. A fundamental part of data-driven modeling is molecular featurization. Herein, we propose a molecular representation with group-constitutive and geometrical descriptors that map to enthalpy and entropy–two thermodynamic quantities that drive phase transitions. The descriptors are inspired by the linear regression-based quantitative structure-property relationship of Yalkowsky and coworkers known as the Unified Physicochemical Property Estimation Relationships (UPPER).