Goto

Collaborating Authors

 Ensemble Learning


Quick Introduction to Boosting Algorithms in Machine Learning

#artificialintelligence

Lots of analyst misinterpret the term'boosting' used in data science. Let me provide an interesting explanation of this term. Boosting grants power to machine learning models to improve their accuracy of prediction. Boosting algorithms are one of the most widely used algorithm in data science competitions. The winners of our last hackathons agree that they try boosting algorithm to improve accuracy of their models.


Evasion and Hardening of Tree Ensemble Classifiers

arXiv.org Machine Learning

Classifier evasion consists in finding for a given instance $x$ the nearest instance $x'$ such that the classifier predictions of $x$ and $x'$ are different. We present two novel algorithms for systematically computing evasions for tree ensembles such as boosted trees and random forests. Our first algorithm uses a Mixed Integer Linear Program solver and finds the optimal evading instance under an expressive set of constraints. Our second algorithm trades off optimality for speed by using symbolic prediction, a novel algorithm for fast finite differences on tree ensembles. On a digit recognition task, we demonstrate that both gradient boosted trees and random forests are extremely susceptible to evasions. Finally, we harden a boosted tree model without loss of predictive accuracy by augmenting the training set of each boosting round with evading instances, a technique we call adversarial boosting.


Complete Guide to Parameter Tuning in XGBoost (with codes in Python)

#artificialintelligence

XGBoost algorithm has become the ultimate weapon of many data scientist. It's a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data. Building a model using XGBoost is easy. But, improving the model using XGBoost is difficult (at least I struggled a lot). This algorithm uses multiple parameters.


ABC random forests for Bayesian parameter inference

#artificialintelligence

Before leaving Helsinki, we arXived [from the Air France lounge!] the paper Jean-Michel presented on Monday at ABCruise in Helsinki. This paper summarises the experiments Louis conducted over the past months to assess the great performances of a random forest regression approach to ABC parameter inference. I think the major incentives in exploiting the (still mysterious) tool of random forests [against more traditional ABC approaches like Fearnhead and Prangle (2012) on summary selection] are that (i) forests do not require a preliminary selection of the summary statistics, since an arbitrary number of summaries can be used as input for the random forest, even when including a large number of useless white noise variables; (b) there is no longer a tolerance level involved in the process, since the many trees in the random forest define a natural if rudimentary distance that corresponds to being or not being in the same leaf as the observed vector of summary statistics?(y); To the point that deriving a different forest for each univariate transform of interest is truly a minor drag in the overall computing cost of the approach. An intriguing point we uncovered through Louis' experiments is that an unusual version of the variance estimator is preferable to the standard estimator: we indeed exposed better estimation performances when using a weighted version of the out-of-bag residuals (which are computed as the differences between the simulated value of the parameter transforms and their expectation obtained by removing the random trees involving this simulated value). Another intriguing feature [to me] is that the regression weights as proposed by Meinshausen (2006) are obtained as an average of the inverse of the number of terms in the leaf of interest.


XGBoost

#artificialintelligence

XGBoost is an optimized distributed gradient boosting system designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting(also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment(Hadoop, SGE, MPI) and can solve problems beyond billions of examples. XGBoost open source project is actively developed by amazing contributors from DMLC/XGBoost community. This work was supported in part by ONR (PECASE) N000141010672, NSF IIS 1258741 and the TerraSwarm Research Center sponsored by MARCO and DARPA.


How to Rank 10% in Your First Kaggle Competition

#artificialintelligence

Kaggle is the best place for learning from other data scientists. Many companies provide data and prize money to set up data science competitions on Kaggle. Recently I had my first shot on Kaggle and ranked 98th ( 5%) among 2125 teams. Since this is my Kaggle debut, I feel quite satisfied. Because many Kaggle beginners set 10% as their first goal, here I want to share my experience in achieving that goal. Most Kagglers use Python and R. I prefer Python, but R users should have no difficulty in understanding the ideas behind tools and languages. First let's go through some facts about Kaggle competitions in case you are not very familiar with them. Different competitions have different tasks: classification, regression, recommendation, orderingโ€ฆ Training set and testing set will be open for download after the competition launches.


Context-dependent feature analysis with random forests

arXiv.org Machine Learning

In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances framework in order (i) to identify variables whose relevance is context-dependent and (ii) to characterize as precisely as possible the effect of contextual information on these variables. The usage and the relevance of our framework for highlighting context-dependent variables is illustrated on both artificial and real datasets.


Random forests for survival analysis using maximally selected rank statistics

arXiv.org Machine Learning

The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption is not always fulfilled. An alternative approach is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistics, which favors splitting variables with many possible split points. Conditional inference forests avoid this split point selection bias. However, linear rank statistics are utilized in current software for conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. We therefore use maximally selected rank statistics for split point selection in random forests for survival analysis. As in conditional inference forests, p-values for association between split points and survival time are minimized. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split point selection is possible. However, there is a trade-off between unbiased split point selection and runtime. In benchmark studies of prediction performance on simulated and real datasets the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used.


Improving performance of random forests for a particular value of outcome by adding chosen features

#artificialintelligence

Choosing features to improve a performance of a particular algorithm is a difficult question. Currently here is PCA, which is difficult to understand (although it can be used out-of-the-box), requires centralizing and scaling of features and is not easy to interpret. In addition, it does not allows to improve prediction performance for a particular outcome (if its accuracy is lower than for others or it has a particular importance). My method enables to use features without preprocessing. Therefore a resulting prediction is easy to explain.


Understanding Gradient Boosting, Part 1 -- Data Stuff

#artificialintelligence

Though there are many possible supervised learning model types to choose from, gradient boosted models (GBMs) are almost always my first choice. In many cases, they end up outperforming other options, and even when they don't, it's rare that a properly tuned GBM is far behind the best model. At a high level, the way GBMs work is by starting with a rough prediction and then building a series of decision trees, with each tree in the series trying to correct the prediction error of the tree before it. There's more detailed descriptions of the mechanics behind the algorithm out there, but this series of posts is intended to give more of an intuitive understanding of what the algorithm does. For this series, I'll be using a synthetic 2-dimensional classification dataset generated using scikit-learn's make_classification().