Ensemble Learning
FARF: A Fair and Adaptive Random Forests Classifier
Zhang, Wenbin, Bifet, Albert, Zhang, Xiangliang, Weiss, Jeremy C., Nejdl, Wolfgang
As Artificial Intelligence (AI) is used in more applications, the need to consider and mitigate biases from the learned models has followed. Most works in developing fair learning algorithms focus on the offline setting. However, in many real-world applications data comes in an online fashion and needs to be processed on the fly. Moreover, in practical application, there is a trade-off between accuracy and fairness that needs to be accounted for, but current methods often have multiple hyperparameters with non-trivial interaction to achieve fairness. In this paper, we propose a flexible ensemble algorithm for fair decision-making in the more challenging context of evolving online settings. This algorithm, called FARF (Fair and Adaptive Random Forests), is based on using online component classifiers and updating them according to the current distribution, that also accounts for fairness and a single hyperparameters that alters fairness-accuracy balance. Experiments on real-world discriminated data streams demonstrate the utility of FARF.
LightGBM in Python
A Gradient Boosting Decision tree or a GBDT is a very popular machine learning algorithm that has effective implementations like XGBoost and many optimization techniques are actually adopted from this algorithm. The efficiency and scalability of the model are not quite up to the mark when there are more features in the data. For this specific behavior, the major reason is that each feature should scan all the various data instances to make an estimate of all the possible split points which is very time-consuming and tedious. To solve this problem, The LGBM or Light Gradient Boosting Model is used. It uses two types of techniques which are gradient Based on side sampling or GOSS and Exclusive Feature bundling or EFB.
Gradient Boosted Decision Trees explained with a real-life example and some Python code
Gradient Boosting algorithms tackle one of the biggest problems in Machine Learning: bias. Decision Trees is a simple and flexible algorithm. An underfit Decision Tree has low depth, meaning it splits the dataset only a few of times in an attempt to separate the data. Because it doesn't separate the dataset into more and more distinct observations, it can't capture the true patterns in it. When it comes to tree-based algorithms Random Forests was revolutionary, because it used Bagging to reduce the overall variance of the model with an ensemble of random trees.
MOFit: A Framework to reduce Obesity using Machine learning and IoT
Garg, Satvik, Pundir, Pradyumn
From the past few years, due to advancements in technologies, the sedentary living style in urban areas is at its peak. This results in individuals getting a victim of obesity at an early age. There are various health impacts of obesity like Diabetes, Heart disease, Blood pressure problems, and many more. Machine learning from the past few years is showing its implications in all expertise like forecasting, healthcare, medical imaging, sentiment analysis, etc. In this work, we aim to provide a framework that uses machine learning algorithms namely, Random Forest, Decision Tree, XGBoost, Extra Trees, and KNN to train models that would help predict obesity levels (Classification), Bodyweight, and fat percentage levels (Regression) using various parameters. We also applied and compared various hyperparameter optimization (HPO) algorithms such as Genetic algorithm, Random Search, Grid Search, Optuna to further improve the accuracy of the models. The website framework contains various other features like making customizable Diet plans, workout plans, and a dashboard to track the progress. The framework is built using the Python Flask. Furthermore, a weighing scale using the Internet of Things (IoT) is also integrated into the framework to track calories and macronutrients from food intake.
A Framework for an Assessment of the Kernel-target Alignment in Tree Ensemble Kernel Learning
Feng, Dai, Baumgartner, Richard
Kernels ensuing from tree ensembles such as random forest (RF) or gradient boosted trees (GBT), when used for kernel learning, have been shown to be competitive to their respective tree ensembles (particularly in higher dimensional scenarios). On the other hand, it has been also shown that performance of the kernel algorithms depends on the degree of the kernel-target alignment. However, the kernel-target alignment for kernel learning based on the tree ensembles has not been investigated and filling this gap is the main goal of our work. Using the eigenanalysis of the kernel matrix, we demonstrate that for continuous targets good performance of the tree-based kernel learning is associated with strong kernel-target alignment. Moreover, we show that well performing tree ensemble based kernels are characterized by strong target aligned components that are expressed through scalar products between the eigenvectors of the kernel matrix and the target. This suggests that when tree ensemble based kernel learning is successful, relevant information for the supervised problem is concentrated near lower dimensional manifold spanned by the target aligned components. Persistence of the strong target aligned components in tree ensemble based kernels is further supported by sensitivity analysis via landmark learning. In addition to a comprehensive simulation study, we also provide experimental results from several real life data sets that are in line with the simulations.
An Overview of Boosting Methods: CatBoost, XGBoost, AdaBoost, LightBoost, Histogram-Based Gradientโฆ
In ensemble learning, it is aimed to train the model most successfully with multiple learning algorithms. In one of the ensemble learning, Bagging method, more than one model was applied to different subsamples of the same dataset in parallel. Boosting, which is another method and frequently used in practice, builds sequentially instead of parallelly and aims to train the algorithm as well as training the model. A weak algorithm trains the model, then it is re-organized according to the training results and it is made easier to learn. This modified model is then sent to the next algorithm and the second algorithm learns easier than the first one. This article contains different boosting methods that interpret this sequential method from different angles.
Look Before You Leap! Designing a Human-Centered AI System for Change Risk Assessment
Gupta, Binay, Chatterjee, Anirban, Matha, Harika, Banerjee, Kunal, Parsai, Lalitdutt, Agneeswaran, Vijay
Reducing the number of failures in a production system is one of the most challenging problems in technology driven industries, such as, the online retail industry. To address this challenge, change management has emerged as a promising sub-field in operations that manages and reviews the changes to be deployed in production in a systematic manner. However, it is practically impossible to manually review a large number of changes on a daily basis and assess the risk associated with them. This warrants the development of an automated system to assess the risk associated with a large number of changes. There are a few commercial solutions available to address this problem but those solutions lack the ability to incorporate domain knowledge and continuous feedback from domain experts into the risk assessment process. As part of this work, we aim to bridge the gap between model-driven risk assessment of change requests and the assessment of domain experts by building a continuous feedback loop into the risk assessment process. Here we present our work to build an end-to-end machine learning system along with the discussion of some of practical challenges we faced related to extreme skewness in class distribution, concept drift, estimation of the uncertainty associated with the model's prediction and the overall scalability of the system.
Development of Machine Learning Models to Predict Probabilities and Types of Stroke at Prehospital Stage: the Japan Urgent Stroke Triage Score Using Machine Learning (JUST-ML) - PubMed
In conjunction with recent advancements in machine learning (ML), such technologies have been applied in various fields owing to their high predictive performance. We tried to develop prehospital stroke scale with ML. We conducted multi-center retrospective and prospective cohort study. The training cohort had eight centers in Japan from June 2015 to March 2018, and the test cohort had 13 centers from April 2019 to March 2020. We use the three different ML algorithms (logistic regression, random forests, XGBoost) to develop models.
100% Off Coupon - Machine Learning & Deep Learning in Python & R
Learn how to solve real life problem using the Machine learning techniques Machine Learning models such as Linear Regression, Logistic Regression, KNN etc. Advanced Machine Learning models such as Decision trees, XGBoost, Random Forest, SVM etc. Understanding of basics of statistics and concepts of Machine Learning How to do basic statistical operations and run ML models in Python Indepth knowledge of data collection and data preprocessing for Machine Learning problem How to convert business problem into a Machine learning problem
Data-driven advice for interpreting local and global model predictions in bioinformatics problems
Tree-based algorithms such as random forests and gradient boosted trees continue to be among the most popular and powerful machine learning models used across multiple disciplines. The conventional wisdom of estimating the impact of a feature in tree based models is to measure the \textit{node-wise reduction of a loss function}, which (i) yields only global importance measures and (ii) is known to suffer from severe biases. Conditional feature contributions (CFCs) provide \textit{local}, case-by-case explanations of a prediction by following the decision path and attributing changes in the expected output of the model to each feature along the path. However, Lundberg et al. pointed out a potential bias of CFCs which depends on the distance from the root of a tree. The by now immensely popular alternative, SHapley Additive exPlanation (SHAP) values appear to mitigate this bias but are computationally much more expensive. Here we contribute a thorough comparison of the explanations computed by both methods on a set of 164 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. For random forests, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores, leading to very similar rankings and interpretations. Analogous conclusions hold for the fidelity of using global feature importance scores as a proxy for the predictive power associated with each feature.