Goto

Collaborating Authors

 Ensemble Learning


Predictive Modeling: Picking the best model – Towards Data Science

#artificialintelligence

Whether you are working on predicting data in an office setting or just competing in a Kaggle competition, it's important to test out different models to find the best fit for the data you are working with. I recently had the opportunity to compete with some very smart colleagues in a private Kaggle competition predicting faulty water pumps in Tanzania. I ran the following models after doing some data cleaning and I'll show you the results. First, we need to take a look at the data we're working with. In this particular data set, the features were in a separate file than the labels.


Artificial Intelligence and Machine Learning to Predict and Improve Efficiency in Manufacturing Industry

arXiv.org Machine Learning

The overall equipment effectiveness (OEE) is a performance measurement metric widely used. Its calculation provides to the managers the possibility to identify the main losses that reduce the machine effectiveness and then take the necessary decisions in order to improve the situation. However, this calculation is done a-posterior which is often too late. In the present research, we implemented different Machine Learning algorithms namely; Support vector machine, Optimized Support vector Machine (using Genetic Algorithm), Random Forest, XGBoost and Deep Learning to predict the estimate OEE value. The data used to train our models was provided by an automotive cable production industry. The results show that the Deep Learning and Random Forest are more accurate and present better performance for the prediction of the overall equipment effectiveness in our case study.


Optimal Minimal Margin Maximization with Boosting

arXiv.org Machine Learning

Boosting algorithms produce a classifier by iteratively combining base hypotheses. It has been observed experimentally that the generalization error keeps improving even after achieving zero training error. One popular explanation attributes this to improvements in margins. A common goal in a long line of research, is to maximize the smallest margin using as few base hypotheses as possible, culminating with the AdaBoostV algorithm by (R{\"a}tsch and Warmuth [JMLR'04]). The AdaBoostV algorithm was later conjectured to yield an optimal trade-off between number of hypotheses trained and the minimal margin over all training points (Nie et al. [JMLR'13]). Our main contribution is a new algorithm refuting this conjecture. Furthermore, we prove a lower bound which implies that our new algorithm is optimal.


Gradient Regularized Budgeted Boosting

arXiv.org Machine Learning

As machine learning transitions increasingly towards real world applications controlling the test-time cost of algorithms becomes more and more crucial. Recent work, such as the Greedy Miser and Speedboost, incorporate test-time budget constraints into the training procedure and learn classifiers that provably stay within budget (in expectation). However, so far, these algorithms are limited to the supervised learning scenario where sufficient amounts of labeled data are available. In this paper we investigate the common scenario where labeled data is scarce but unlabeled data is available in abundance. We propose an algorithm that leverages the unlabeled data (through Laplace smoothing) and learns classifiers with budget constraints. Our model, based on gradient boosted regression trees (GBRT), is, to our knowledge, the first algorithm for semi-supervised budgeted learning.


Faster Boosting with Smaller Memory

arXiv.org Machine Learning

The two state-of-the-art implementations of boosted trees: XGBoost and LightGBM, can process large training sets extremely fast. However, this performance requires that memory size is sufficient to hold a 2-3 multiple of the training set size. This paper presents an alternative approach to implementing boosted trees. which achieves a significant speedup over XGBoost and LightGBM, especially when memory size is small. This is achieved using a combination of two techniques: early stopping and stratified sampling, which are explained and analyzed in the paper. We describe our implementation and present experimental results to support our claims.


SecureBoost: A Lossless Federated Learning Framework

arXiv.org Machine Learning

The protection of user privacy is an important concern in machine learning, as evidenced by the rolling out of the General Data Protection Regulation (GDPR) in the European Union (EU) in May 2018. The GDPR is designed to give users more control over their personal data, which motivates us to explore machine learning frameworks with data sharing without violating user privacy. To meet this goal, in this paper, we propose a novel lossless privacy-preserving tree-boosting system known as SecureBoost in the setting of federated learning. This federated-learning system allows a learning process to be jointly conducted over multiple parties with partially common user samples but different feature sets, which corresponds to a vertically partitioned virtual data set. An advantage of SecureBoost is that it provides the same level of accuracy as the non-privacy-preserving approach while at the same time, reveal no information of each private data provider. We theoretically prove that the SecureBoost framework is as accurate as other non-federated gradient tree-boosting algorithms that bring the data into one place. In addition, along with a proof of security, we discuss what would be required to make the protocols completely secure.


A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization

arXiv.org Machine Learning

This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structured Parzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank Test. The performance of XGBoost is compared to the traditionally utilized logistic regression (LR) model in terms of classification accuracy, area under the curve (AUC), recall, and F1 score obtained from the 10-fold cross validation. Results show that hierarchical clustering is the optimal FS method for LR while weight by Chi-square achieves the best performance in XG-Boost. Both TPE and RS optimization in XGBoost outperform LR significantly. TPE optimization shows a superiority over RS since it results in a significantly higher accuracy and a marginally higher AUC, recall and F1 score. Furthermore, XGBoost with TPE tuning shows a lower variability than the RS method. Finally, the ranking of feature importance based on XGBoost enhances the model interpretation. Therefore, XGBoost with Bayesian TPE hyper-parameter optimization serves as an operative while powerful approach for business risk modeling.


Gentle Introduction of XGBoost Library – Mohit Sharma – Medium

#artificialintelligence

In this article, you will discover XGBoost and get a gentle introduction to what it is, where it came from and how you can learn more. Bagging: It is an approach where you take random samples of data, build learning algorithms and take simple means to find bagging probabilities. Boosting: Boosting is similar, however, the selection of sample is made more intelligently. We subsequently give more and more weight to hard to classify observations. XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.


Machine learning for predicting thermal power consumption of the Mars Express Spacecraft

arXiv.org Machine Learning

The thermal subsystem of the Mars Express (MEX) spacecraft keeps the on-board equipment within its pre-defined operating temperatures range. To plan and optimize the scientific operations of MEX, its operators need to estimate in advance, as accurately as possible, the power consumption of the thermal subsystem. The remaining power can then be allocated for scientific purposes. We present a machine learning pipeline for efficiently constructing accurate predictive models for predicting the power of the thermal subsystem on board MEX. In particular, we employ state-of-the-art feature engineering approaches for transforming raw telemetry data, in turn used for constructing accurate models with different state-of-the-art machine learning methods. We show that the proposed pipeline considerably improve our previous (competition-winning) work in terms of time efficiency and predictive performance. Moreover, while achieving superior predictive performance, the constructed models also provide important insight into the spacecraft's behavior, allowing for further analyses and optimal planning of MEX's operation.


Gradient Boosted Feature Selection

arXiv.org Machine Learning

A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify non-linear feature interactions; scale linearly with the number of features and dimensions; allow the incorporation of known sparsity structure. In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. The algorithm is flexible, scalable, and surprisingly straight-forward to implement as it is based on a modification of Gradient Boosted Trees. We evaluate GBFS on several real world data sets and show that it matches or out-performs other state of the art feature selection algorithms. Yet it scales to larger data set sizes and naturally allows for domain-specific side information.