Goto

Collaborating Authors

 Ensemble Learning


BANKNOTE AUTHENTICATION USING RANDOM FOREST -- WITH SOURCE CODE -- EASY PROJECT

#artificialintelligence

In today's blog, we will see that how we can perform Bank Note Authentication or how we can classify Bank Notes into fake or authentic classes based on numeric features like variance, skewness, kurtosis, entropy. This is going to be a very short blog, so without any further due. To explore more Machine Learning, Deep Learning, Computer Vision, NLP, Flask Projects visit my blog.


GAM Changer: Editing Generalized Additive Models with Interactive Visualization

arXiv.org Artificial Intelligence

Recent strides in interpretable machine learning (ML) research reveal that models exploit undesirable patterns in the data to make predictions, which potentially causes harms in deployment. However, it is unclear how we can fix these models. We present our ongoing work, GAM Changer, an open-source interactive system to help data scientists and domain experts easily and responsibly edit their Generalized Additive Models (GAMs). With novel visualization techniques, our tool puts interpretability into action -- empowering human users to analyze, validate, and align model behaviors with their knowledge and values. Built using modern web technologies, our tool runs locally in users' computational notebooks or web browsers without requiring extra compute resources, lowering the barrier to creating more responsible ML models. GAM Changer is available at https://interpret.ml/gam-changer.


Local Adaptivity of Gradient Boosting in Histogram Transform Ensemble Learning

arXiv.org Machine Learning

In this paper, we propose a gradient boosting algorithm called \textit{adaptive boosting histogram transform} (\textit{ABHT}) for regression to illustrate the local adaptivity of gradient boosting algorithms in histogram transform ensemble learning. From the theoretical perspective, when the target function lies in a locally H\"older continuous space, we show that our ABHT can filter out the regions with different orders of smoothness. Consequently, we are able to prove that the upper bound of the convergence rates of ABHT is strictly smaller than the lower bound of \textit{parallel ensemble histogram transform} (\textit{PEHT}). In the experiments, both synthetic and real-world data experiments empirically validate the theoretical results, which demonstrates the advantageous performance and local adaptivity of our ABHT.


VisRuler: Visual Analytics for Extracting Decision Rules from Bagged and Boosted Decision Trees

arXiv.org Machine Learning

Bagging and boosting are two popular ensemble methods in machine learning (ML) that produce many individual decision trees. Due to the inherent ensemble characteristic of these methods, they typically outperform single decision trees or other ML models in predictive performance. However, numerous decision paths are generated for each decision tree, increasing the overall complexity of the model and hindering its use in domains that require trustworthy and explainable decisions, such as finance, social care, and health care. Thus, the interpretability of bagging and boosting algorithms, such as random forests and adaptive boosting, reduces as the number of decisions rises. In this paper, we propose a visual analytics tool that aims to assist users in extracting decisions from such ML models via a thorough visual inspection workflow that includes selecting a set of robust and diverse models (originating from different ensemble learning algorithms), choosing important features according to their global contribution, and deciding which decisions are essential for global explanation (or locally, for specific cases). The outcome is a final decision based on the class agreement of several models and the explored manual decisions exported by users. Finally, we evaluate the applicability and effectiveness of VisRuler via a use case, a usage scenario, and a user study.


Improve Random Forest with Linear Models

#artificialintelligence

Random Forest is probably considered by most the silver bullet in supervised prediction tasks. For sure, any data scientist involved in standard machine learning applications is used to fit and benchmark a Random Forest. Random Forest is a well-known algorithm in literature and is proven to reach satisfactory results in both regression and classification contexts. It enjoys the ability to learn complex data relationships with low effort. There are a lot of open-sourced efficient implementations which are available to all of us (the one provided by scikit-learn is for sure the most famous).


How to use the Lazy Predict library to select the best machine learning model

#artificialintelligence

Machine learning is a hot topic in data science, but few people understand the fundamental concepts behind them. You may be fascinated by how people get high paying jobs because they know how to execute machine learning, only to be quickly intimidated by the sophisticated theorems and mathematics behind machine learning. While I am no machine learning expert, I hope to provide some basics about machine learning and how you can potentially use Python to perform machine learning. With all the available machine learning tools available at your fingertips, it is often tempting to jump straight into solving a data-related problem by running your favourite algorithm. However, this is usually a bad way to begin your analysis.


An XGBoost-Based Forecasting Framework for Product Cannibalization

arXiv.org Artificial Intelligence

One of the major challenges in making such forecasts is taking the effect of product cannibalization into account. Product cannibalization occurs when demand for a certain product within the portfolio increases that may be due to launch of a new product. This consequently reduces the sales of older products. This interaction between different data samples leads to the fact that total demand of all products remains stable but with large variations in the demand of individual products within the portfolio. Machine learning allows us to model complex dynamics and capture large number of input variables over traditional statistical models. Generally, machine learning models try to optimize the cost function by using input features to the model and updating the model parameters accordingly. However, in product cannibalization the demand of a given product is being impacted by the demand of a different product that is not a part of the input feature set. In this work, the proposed framework is to make accurate sales forecast of old products that are cannibalized due to launch of newer products.


Prediction Model for Mortality Analysis of Pregnant Women Affected With COVID-19

arXiv.org Artificial Intelligence

COVID-19 pandemic is an ongoing global pandemic which has caused unprecedented disruptions in the public health sector and global economy. The virus, SARS-CoV-2 is responsible for the rapid transmission of coronavirus disease. Due to its contagious nature, the virus can easily infect an unprotected and exposed individual from mild to severe symptoms. The study of the virus effects on pregnant mothers and neonatal is now a concerning issue globally among civilians and public health workers considering how the virus will affect the mother and the neonates health. This paper aims to develop a predictive model to estimate the possibility of death for a COVID-diagnosed mother based on documented symptoms: dyspnea, cough, rhinorrhea, arthralgia, and the diagnosis of pneumonia. The machine learning models that have been used in our study are support vector machine, decision tree, random forest, gradient boosting, and artificial neural network. The models have provided impressive results and can accurately predict the mortality of pregnant mothers with a given input.The precision rate for 3 models(ANN, Gradient Boost, Random Forest) is 100% The highest accuracy score(Gradient Boosting,ANN) is 95%,highest recall(Support Vector Machine) is 92.75% and highest f1 score(Gradient Boosting,ANN) is 94.66%. Due to the accuracy of the model, pregnant mother can expect immediate medical treatment based on their possibility of death due to the virus. The model can be utilized by health workers globally to list down emergency patients, which can ultimately reduce the death rate of COVID-19 diagnosed pregnant mothers.


There is no Double-Descent in Random Forests

arXiv.org Machine Learning

Random Forests (RFs) are among the state-of-the-art in machine learning and offer excellent performance with nearly zero parameter tuning. Remarkably, RFs seem to be impervious to overfitting even though their basic building blocks are well-known to overfit. Recently, a broadly received study argued that a RF exhibits a so-called double-descent curve: First, the model overfits the data in a u-shaped curve and then, once a certain model complexity is reached, it suddenly improves its performance again. In this paper, we challenge the notion that model capacity is the correct tool to explain the success of RF and argue that the algorithm which trains the model plays a more important role than previously thought. We show that a RF does not exhibit a double-descent curve but rather has a single descent. Hence, it does not overfit in the classic sense. We further present a RF variation that also does not overfit although its decision boundary approximates that of an overfitted DT. Similar, we show that a DT which approximates the decision boundary of a RF will still overfit. Last, we study the diversity of an ensemble as a tool the estimate its performance. To do so, we introduce Negative Correlation Forest (NCForest) which allows for precise control over the diversity in the ensemble. We show, that the diversity and the bias indeed have a crucial impact on the performance of the RF. Having too low diversity collapses the performance of the RF into a a single tree, whereas having too much diversity means that most trees do not produce correct outputs anymore. However, in-between these two extremes we find a large range of different trade-offs with all roughly equal performance. Hence, the specific trade-off between bias and diversity does not matter as long as the algorithm reaches this good trade-off regime.


Oblique and rotation double random forest

arXiv.org Artificial Intelligence

An ensemble of decision trees is known as Random Forest. As suggested by Breiman, the strength of unstable learners and the diversity among them are the ensemble models' core strength. In this paper, we propose two approaches known as oblique and rotation double random forests. In the first approach, we propose a rotation based double random forest. In rotation based double random forests, transformation or rotation of the feature space is generated at each node. At each node different random feature subspace is chosen for evaluation, hence the transformation at each node is different. Different transformations result in better diversity among the base learners and hence, better generalization performance. With the double random forest as base learner, the data at each node is transformed via two different transformations namely, principal component analysis and linear discriminant analysis. In the second approach, we propose oblique double random forest. Decision trees in random forest and double random forest are univariate, and this results in the generation of axis parallel split which fails to capture the geometric structure of the data. Also, the standard random forest may not grow sufficiently large decision trees resulting in suboptimal performance. To capture the geometric properties and to grow the decision trees of sufficient depth, we propose oblique double random forest. The oblique double random forest models are multivariate decision trees. At each non-leaf node, multisurface proximal support vector machine generates the optimal plane for better generalization performance. Also, different regularization techniques (Tikhonov regularisation and axis-parallel split regularisation) are employed for tackling the small sample size problems in the decision trees of oblique double random forest.