Goto

Collaborating Authors

 Ensemble Learning


XGBoost, GPUs and Scikit-Learn. YES! – Towards Data Science

#artificialintelligence

During my Machine Learning studies I developed a taste for fast Machine Learning pipelines. Since python provides coding versatility it is an obvious choice for this endeavor. Scikit-Learn is an excellent framework to use any type of algorithm you might want to, i.e. most Machine Learning algorithms provide an interface for it. One popular example for this is xgboost. Although the interface exists it lacks a lot of functionality, e.g.


Temporal Stability in Predictive Process Monitoring

arXiv.org Machine Learning

Noname manuscript No. (will be inserted by the editor) Abstract Predictive business process monitoring is concerned with the analysis of events produced during the execution of a business process in order to predict as early as possible the final outcome of an ongoing case. Traditionally, predictive process monitoring methods are optimized with respect to accuracy. However, in environments where users make decisions and take actions in response to the predictions they receive, it is equally important to optimize the stability of the successive predictions made for each case. To this end, this paper defines a notion of temporal stability for predictive process monitoring and evaluates existing methods with respect to both temporal stability and accuracy. We find that methods based on XGBoost and LSTM neural networks exhibit the highest temporal stability. We then show that temporal stability can be enhanced by hyperparameter-optimizing random forests and XGBoost classifiers with respect to inter-run stability. Finally, we show that time series smoothing techniques can further enhance temporal stability at the expense of slightly lower accuracy. Keywords Predictive Monitoring · Early Sequence Classification · Stability 1 Introduction Modern organizations generally execute their business processes on top of processaware information systems, such as Enterprise Resource Planning (ERP) systems, Customer Relationship Management (CRM) systems, and Business Process Management Systems (BPMS), among others [8]. These systems record a range of events that occur during the execution of the processes they support, including events signaling the creation and completion of business process instances (herein called cases) and the start and completion of activities within each case. Event records produced by process-aware information systems can be extracted and pre-processed to produce business process event logs [1]. A business process event log consists of a set of traces, each trace consisting of the sequence of event records produced by one case. Each event record has a number of attributes. Three of these attributes are present in every event record, namely the event class (a.k.a. In other words, every event represents the occurrence of an activity at a particular point in time and in the context of a given case.


Tree Boosting With XGBoost -- Why Does XGBoost Win "Every" Machine Learning Competition?

@machinelearnbot

Tree boosting has empirically proven to be efficient for predictive mining for both classification and regression. For many years, MART (multiple additive regression trees) has been the tree boosting method of choice. But a starting from 2015, a first to try, always winning algorithm surged to the surface: XGBoost. This algorithm re-implements the tree boosting and gained popularity by winning Kaggle and other data science competition. The paper introduce in first place the supervised learning task and discuss the model selection techniques.


Who wins the Miss Contest for Imputation Methods? Our Vote for Miss BooPF

arXiv.org Machine Learning

Missing data is an expected issue when large amounts of data is collected, and several imputation techniques have been proposed to tackle this problem. Beneath classical approaches such as MICE, the application of Machine Learning techniques is tempting. Here, the recently proposed missForest imputation method has shown high imputation accuracy under the Missing (Completely) at Random scheme with various missing rates. In its core, it is based on a random forest for classification and regression, respectively. In this paper we study whether this approach can even be enhanced by other methods such as the stochastic gradient tree boosting method, the C5.0 algorithm or modified random forest procedures. In particular, other resampling strategies within the random forest protocol are suggested. In an extensive simulation study, we analyze their performances for continuous, categorical as well as mixed-type data. Therein, MissBooPF, a combination of the stochastic gradient tree boosting method together with the parametrically bootstrapped random forest method, appeared to be promising. Finally, an empirical analysis focusing on credit information and Facebook data is conducted.


ForestHash: Semantic Hashing With Shallow Random Forests and Tiny Convolutional Networks

arXiv.org Machine Learning

Hash codes are efficient data representations for coping with the ever growing amounts of data. In this paper, we introduce a random forest semantic hashing scheme that embeds tiny convolutional neural networks (CNN) into shallow random forests, with near-optimal information-theoretic code aggregation among trees. We start with a simple hashing scheme, where random trees in a forest act as hashing functions by setting `1' for the visited tree leaf, and `0' for the rest. We show that traditional random forests fail to generate hashes that preserve the underlying similarity between the trees, rendering the random forests approach to hashing challenging. To address this, we propose to first randomly group arriving classes at each tree split node into two groups, obtaining a significantly simplified two-class classification problem, which can be handled using a light-weight CNN weak learner. Such random class grouping scheme enables code uniqueness by enforcing each class to share its code with different classes in different trees. A non-conventional low-rank loss is further adopted for the CNN weak learners to encourage code consistency by minimizing intra-class variations and maximizing inter-class distance for the two random class groups. Finally, we introduce an information-theoretic approach for aggregating codes of individual trees into a single hash code, producing a near-optimal unique hash for each class. The proposed approach significantly outperforms state-of-the-art hashing methods for image retrieval tasks on large-scale public datasets, while performing at the level of other state-of-the-art image classification techniques while utilizing a more compact and efficient scalable representation. This work proposes a principled and robust procedure to train and deploy in parallel an ensemble of light-weight CNNs, instead of simply going deeper.


Which algorithm takes the crown: Light GBM vs XGBOOST?

@machinelearnbot

If you are an active member of the Machine Learning community, you must be aware of Boosting Machines and their capabilities. The development of Boosting Machines started from ADABOOST to today's favourite XGBOOST. XGBOOST has become a de-facto algorithm for winning competitions at Analytics Vidhya and Kaggle, simply because it is extremely powerful. But given lots and lots of data, even XGBOOST takes a long time to train. Many of you might not be familiar with the Light Gradient Boosting, but you will be after reading this article. The most natural question that will come to your mind is – Why another boosting machine algorithm?


Tree-Structured Boosting: Connections Between Gradient Boosted Stumps and Full Decision Trees

arXiv.org Machine Learning

Classification And Regression Tree (CART) analysis Breiman et al. [1984] is a well-established statistical learning technique, which has been adopted by numerous other fields for its model interpretability, scalability to large data sets, and connection to rule-based decision making Loh [2014]. CART builds a model by recursively partitioning the instance space, labeling each partition with either a predicted category (in the case of classification) or real-value (in the case of regression). Despite their widespread use, CART models often have lower predictive performance than other statistical learning models, such as kernel methods and ensemble techniques Caruana and Niculescu-Mizil [2006]. Among the latter, boosting methods were developed as a means to train an ensemble of weak learners (often CART models) iteratively into a high-performance predictive model, albeit with a loss of model interpretability. In particular, gradient boosting methods Friedman [2001] focus on iteratively optimizing an ensemble's prediction to increasingly match the labeled training data. Historically these two categories of approaches, CART and gradient boosting, have been studied separately, connected primarily through CART models being used as the weak learners in boosting. This paper investigates a deeper and surprising connection between full interaction models like CART and additive models like gradient boosting, showing that the resulting models exist upon a spectrum. In particular, this paper includes the following contributions: - We introduce tree-structured boosting (TSB) as a new mechanism for creating a hierarchical ensemble model that recursively partitions the instance space, forming a perfect binary tree of weak learners. Each path from the root node to a leaf represents the outcome of a gradient boosted stumps (GBS) ensemble for a particular partition of the instance space.


A Simple XGBoost Tutorial Using the Iris Dataset

@machinelearnbot

I had the opportunity to start using xgboost machine learning algorithm, it is fast and shows good results. Here I will be using multiclass prediction with the iris dataset from scikit-learn. In order to work with the data, I need to install various scientific libraries for python. The best way I have found is to use Anaconda. It simply installs all the libs and helps to install new ones.


ABC random forests for Bayesian parameter inference

arXiv.org Machine Learning

This preprint has been reviewed and recommended by Peer Community In Evolutionary Biology (http://dx.doi.org/10.24072/pci.evolbiol.100036). Approximate Bayesian computation (ABC) has grown into a standard methodology that manages Bayesian inference for models associated with intractable likelihood functions. Most ABC implementations require the preliminary selection of a vector of informative statistics summarizing raw data. Furthermore, in almost all existing implementations, the tolerance level that separates acceptance from rejection of simulated parameter values needs to be calibrated. We propose to conduct likelihood-free Bayesian inferences about parameters with no prior selection of the relevant components of the summary statistics and bypassing the derivation of the associated tolerance level. The approach relies on the random forest methodology of Breiman (2001) applied in a (non parametric) regression setting. We advocate the derivation of a new random forest for each component of the parameter vector of interest. When compared with earlier ABC solutions, this method offers significant gains in terms of robustness to the choice of the summary statistics, does not depend on any type of tolerance level, and is a good trade-off in term of quality of point estimator precision and credible interval estimations for a given computing time. We illustrate the performance of our methodological proposal and compare it with earlier ABC methods on a Normal toy example and a population genetics example dealing with human population evolution. All methods designed here have been incorporated in the R package abcrf (version 1.7) available on CRAN.


4 Steps to Machine Learning with Pentaho

#artificialintelligence

The power of Pentaho Data Integration (PDI) for data access, blending and governance has been demonstrated and documented numerous times. However, perhaps less well known is how PDI as a platform, with all its data munging[1] power, is ideally suited to orchestrate and automate up to three stages of the CRISP-DM[2] life-cycle for the data science practitioner: generic data preparation/feature engineering, predictive modeling, and model deployment. By "generic data preparation" we are referring to the process of connecting to (potentially) multiple heterogeneous data sources and then joining, blending, cleaning, filtering, deriving and denormalizing data so that it ready for consumption by machine learning (ML) algorithms. Further ML-specific data transformations, such as supervised discretization, one-hot encoding etc. can then be applied as needed in an ML tool. For the data scientist, PDI can be used to remove the repetitive drudgery involved with manually performing similar data preparation processes repetitively, from one dataset to the next.