Collaborating Authors


Ensemble Machine Learning in Python: Random Forest, AdaBoost


In recent years, we've seen a resurgence in AI, or artificial intelligence, and machine learning. Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts. Google's AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning. Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.

Sarus just released DP-XGBoost


XGBoost is one of the most popular gradient boosted trees library and is featured in many winning solutions on Kaggle competitions. It's written in C and useable in many languages: Python, R, Java, Julia, or Scala. It can run on major distributed environments (Kubernetes, Apache Spark, or Dask) to handle datasets with billions of examples. XGBoost is often used to train models on sensitive data. Since it comes with no privacy guarantee, one can show that personal information may remain in the model weights.



At this point, we all know of XGBoost due to the massive success it has had in numerous Data Science competitions held on platforms like Kaggle. Along with its success, we have seen several variations such as CatBoost and LightGBM. All of these implementations are based on the Gradient Boosting algorithm developed by Friedman¹, which involves iteratively building an ensemble of weak learners (usually decision trees) where each subsequent learner is trained on the previous learner's errors. Let's take a look at some general pseudo-code for the algorithm from Elements of Statistical Learning²: However, this is not complete! A core mechanism which allows boosting to work is a shrinkage parameter that penalizes each learner at each boosting round that is commonly called the'learning rate'.

Machine Learning in Python with 5 Machine Learning Projects


This course is a perfect fit for you. This course will take you step by step into the world of Machine Learning. Machine Learning is the study of computer algorithms that automates analytical model building. It is a branch of Artificial Intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. Machine Learning is actively being used today, perhaps in many more places than one world expects.

XGBoost -- The Undisputed GOAT!


In this article, we'll learn about XGBoost, its background, its widely accepted usage in competitions such as Kaggle's and help you build an intuitive understanding of it by diving into the foundation of this algorithm. XGBoost is an algorithm that is highly flexible, portable, and efficient which is based on a decision tree for ensemble learning for Machine Learning that uses the distributed gradient boosting framework. Machine Learning algorithms are implemented with XGBoost under the Gradient boosting framework. XGBoost is capable of solving data science problems accurately in a short duration with its parallel tree boosting which is also called Gradient Boosting Machine (GBM), Gradient Boosting Decision Trees (GBDT). It is extremely portable and cross-platform enabled such that the very same code can be run on the different major distributed environments such as Hadoop, MPI, and SGE and enables solving problems with well over billions of examples.



Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, one of the most popular is AdaBoost (Adaptive Boosting). The way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This is the technique used by AdaBoost.

Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features Artificial Intelligence

We report two essential improvements in readability assessment: 1. three novel features in advanced semantics and 2. the timely evidence that traditional ML models (e.g. Random Forest, using handcrafted features) can combine with transformers (e.g. RoBERTa) to augment model performance. First, we explore suitable transformers and traditional ML models. Then, we extract 255 handcrafted linguistic features using self-developed extraction software. Finally, we assemble those to create several hybrid models, achieving state-of-the-art (SOTA) accuracy on popular datasets in readability assessment. The use of handcrafted features help model performance on smaller datasets. Notably, our RoBERTA-RF-T1 hybrid achieves the near-perfect classification accuracy of 99%, a 20.3% increase from the previous SOTA.

Beyond Discriminant Patterns: On the Robustness of Decision Rule Ensembles Artificial Intelligence

Local decision rules are commonly understood to be more explainable, due to the local nature of the patterns involved. With numerical optimization methods such as gradient boosting, ensembles of local decision rules can gain good predictive performance on data involving global structure. Meanwhile, machine learning models are being increasingly used to solve problems in high-stake domains including healthcare and finance. Here, there is an emerging consensus regarding the need for practitioners to understand whether and how those models could perform robustly in the deployment environments, in the presence of distributional shifts. Past research on local decision rules has focused mainly on maximizing discriminant patterns, without due consideration of robustness against distributional shifts. In order to fill this gap, we propose a new method to learn and ensemble local decision rules, that are robust both in the training and deployment environments. Specifically, we propose to leverage causal knowledge by regarding the distributional shifts in subpopulations and deployment environments as the results of interventions on the underlying system. We propose two regularization terms based on causal knowledge to search for optimal and stable rules. Experiments on both synthetic and benchmark datasets show that our method is effective and robust against distributional shifts in multiple environments.

WildWood: a new Random Forest algorithm Machine Learning

This paper introduces WildWood (WW), a new ensemble method of Random Forest (RF) type [9]. The main contributions of the paper and the main advantages of WW are as follows. Firstly, we use out-of-bag samples (trees in a RF use different bootstrapped samples) very differently than what is done in standard RF [43, 7]. Indeed, WW uses these samples to compute an aggregation of the predictions of all possible subtrees of each tree in the forest, using aggregation with exponential weights [14]. This leads to much improved predictions: while only leaves contribute to the predictions of a tree in standard RF, the full tree structure contributes to predictions in WW. An illustration of this effect is given in Figure 1 on a toy binary classification example, where we can observe that subtrees aggregation leads to improved and regularized decision functions for each individual tree and for the forest. We further illustrate in Figure 2 that each tree becomes a stronger learner, and that excellent performance can be achieved even when WW uses few trees. A remarkable aspect of WW is that this improvement comes only at a small computational cost, thanks to a technique called "context tree weighting", used in lossless compression or online learning to aggregate all subtrees of a given tree [73, 72, 34, 14, 50]. Also, the predictions of WW do not rely on MCMC approximations required with Bayesian variants of RF [21, 26, 22, 66], which is a clear distinction from such methods.

Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection Machine Learning

Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state of the art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We show that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.