Goto

Collaborating Authors

 Ensemble Learning


Lale: Consistent Automated Machine Learning

arXiv.org Artificial Intelligence

Automated machine learning makes it easier for data scientists to develop pipelines by searching over possible choices for hyperparameters, algorithms, and even pipeline topologies. Unfortunately, the syntax for automated machine learning tools is inconsistent with manual machine learning, with each other, and with error checks. Furthermore, few tools support advanced features such as topology search or higher-order operators. This paper introduces Lale, a library of high-level Python interfaces that simplifies and unifies automated machine learning in a consistent way.


Artificial Intelligence revolutionizes the insurance industry

#artificialintelligence

Pricing: Through predictive models (with algorithms such as random forest, linear regression, xgboost, etc.), we can provide insurance premiums in a more dynamic and precise way. More specifically, they can be personalized according to driving habits, geographic area and commute distance. To the traditional price-setting variables, a new set of variables are added to improve the profitability of the portfolio. These variables depend on the company's own needs/capacities and can range from competitors' prices to the policyholder's traffic record, driver's license age, credit score, as well as external data systems and sources. The interesting thing here is the dynamism in determining the price; the models change based on data inputted over time, then recognize patterns and adjust the rate autonomously.


Uncertainty in Gradient Boosting via Ensembles

arXiv.org Machine Learning

Gradient boosting is a powerful machine learning technique that is particularly successful for tasks containing heterogeneous features and noisy data. While gradient boosting classification models return a distribution over class labels, regressions models typically yield only point predictions. However, for many practical, high-risk applications, it is also important to be able to quantify uncertainty in the predictions to avoid costly mistakes. In this work, we examine a probabilistic ensemble-based framework for deriving uncertainty estimates in the predictions of gradient boosting classification and regression models. Crucially, the proposed approach allows the total uncertainty to be decomposed into \textit{data uncertainty}, which comes from the complexity and noise in data distribution, and \textit{knowledge uncertainty}, coming from the lack of information about a given region of the feature space. Two approaches for generating ensembles are considered: Stochastic Gradient Boosting (SGB) and Stochastic Gradient Langevin Boosting (SGLB). Notably, SGLB also enables the generation of a \emph{virtual} ensemble via only one gradient boosting model, which significantly reduces complexity. Experiments on a range of regression and classification datasets show that ensembles of gradient boosting models yield improved predictive performance, and measures of uncertainty successfully enable detection of out-of-domain inputs.


Introducing the open-source Amazon SageMaker XGBoost algorithm container

#artificialintelligence

XGBoost is a popular and efficient machine learning (ML) algorithm for regression and classification tasks on tabular datasets. It implements a technique known as gradient boosting on trees and performs remarkably well in ML competitions. Since its launch, Amazon SageMaker has supported XGBoost as a built-in managed algorithm. For more information, see Simplify machine learning with XGBoost and Amazon SageMaker. As of this writing, you can take advantage of the open-source Amazon SageMaker XGBoost container, which has improved flexibility, scalability, extensibility, and Managed Spot Training.


Do Decision Trees need Feature Scaling?

#artificialintelligence

Machine Learning algorithms have always been on the path towards evolution since its inception. Today the domain has come a long way from mathematical modelling to ensemble modelling and more. This evolution has seen more robust and SOTA models which is almost bridging the gap between potentials capabilities of human and AI. Ensemble modelling has given us one of those SOTA model XGBoost. Recently I happened to participate in a Machine Learning Hiring Challenge where the problem statement was a classification problem.


DriveML: Self-Drive Machine Learning Projects

#artificialintelligence

Implementing some of the pillars of an automated machine learning pipeline such as (i) Automated data preparation, (ii) Feature engineering, (iii) Model building in classification context that includes techniques such as (a) Regularised regression [1], (b) Logistic regression [2], (c) Random Forest [3], (d) Decision tree [4] and (e) Extreme Gradient Boosting (xgboost) [5], and finally, (iv) Model explanation (using lift chart and partial dependency plots). Also provides some additional features such as generating missing at random (MAR) variables and automated exploratory data analysis. Moreover, function exports the model results with the required plots in an HTML vignette report format that follows the best practices of the industry and the academia.


Engineering Blog - Learnings from Distributed XGBoost on Amazon SageMaker

#artificialintelligence

XGBoost is a popular Python library for gradient boosted decision trees. The implementation allows practitioners to distribute training across multiple compute instances (or workers), which is especially useful for large training sets. One tool used at Zalando for deploying production machine learning models is the managed service from Amazon called SageMaker. XGBoost is already included in SageMaker as a built-in algorithm, meaning that a prebuilt docker container is available. This container also supports distributed training, making it easy to scale training jobs across many instances.


Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

arXiv.org Machine Learning

Automated machine learning (AutoML) can produce complex model ensembles by stacking, bagging, and boosting many individual models like trees, deep networks, and nearest neighbor estimators. While highly accurate, the resulting predictors are large, slow, and opaque as compared to their constituents. To improve the deployment of AutoML on tabular data, we propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks. At the heart of our approach is a data augmentation strategy based on Gibbs sampling from a self-attention pseudolikelihood estimator. Across 30 datasets spanning regression and binary/multiclass classification tasks, FAST-DAD distillation produces significantly better individual models than one obtains through standard training on the original data. Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.


Design and Evaluation of Personalized Free Trials

arXiv.org Machine Learning

Free trial promotions, where users are given a limited time to try the product for free, are a commonly used customer acquisition strategy in the Software as a Service (SaaS) industry. We examine how trial length affect users' responsiveness, and seek to quantify the gains from personalizing the length of the free trial promotions. Our data come from a large-scale field experiment conducted by a leading SaaS firm, where new users were randomly assigned to 7, 14, or 30 days of free trial. First, we show that the 7-day trial to all consumers is the best uniform policy, with a 5.59% increase in subscriptions. Next, we develop a three-pronged framework for personalized policy design and evaluation. Using our framework, we develop seven personalized targeting policies based on linear regression, lasso, CART, random forest, XGBoost, causal tree, and causal forest, and evaluate their performances using the Inverse Propensity Score (IPS) estimator. We find that the personalized policy based on lasso performs the best, followed by the one based on XGBoost. In contrast, policies based on causal tree and causal forest perform poorly. We then link a method's effectiveness in designing policy with its ability to personalize the treatment sufficiently without over-fitting (i.e., capture spurious heterogeneity). Next, we segment consumers based on their optimal trial length and derive some substantive insights on the drivers of user behavior in this context. Finally, we show that policies designed to maximize short-run conversions also perform well on long-run outcomes such as consumer loyalty and profitability.


Learning Gradient Boosted Multi-label Classification Rules

arXiv.org Machine Learning

In multi-label classification, where the evaluation of predictions is less straightforward than in single-label classification, various meaningful, though different, loss functions have been proposed. Ideally, the learning algorithm should be customizable towards a specific choice of the performance measure. Modern implementations of boosting, most prominently gradient boosted decision trees, appear to be appealing from this point of view. However, they are mostly limited to single-label classification, and hence not amenable to multi-label losses unless these are label-wise decomposable. In this work, we develop a generalization of the gradient boosting framework to multi-output problems and propose an algorithm for learning multi-label classification rules that is able to minimize decomposable as well as non-decomposable loss functions. Using the well-known Hamming loss and subset 0/1 loss as representatives, we analyze the abilities and limitations of our approach on synthetic data and evaluate its predictive performance on multi-label benchmarks.