Goto

Collaborating Authors

 Ensemble Learning


Hurricane Forecasting: A Novel Multimodal Machine Learning Framework

arXiv.org Artificial Intelligence

This paper describes a machine learning (ML) framework for tropical cyclone intensity and track forecasting, combining multiple distinct ML techniques and utilizing diverse data sources. Our framework, which we refer to as Hurricast (HURR), is built upon the combination of distinct data processing techniques using gradient-boosted trees and novel encoder-decoder architectures, including CNN, GRU and Transformers components. We propose a deep-feature extractor methodology to mix spatial-temporal data with statistical data efficiently. Our multimodal framework unleashes the potential of making forecasts based on a wide range of data sources, including historical storm data, reanalysis atmospheric images, and operational forecasts. Evaluating our models with current operational forecasts in North Atlantic and Eastern Pacific basins on the last years of available data, results show our models consistently outperform statistical-dynamical models and, albeit less accurate than the best dynamical models, our framework computes forecasts in seconds. Furthermore, the inclusion of Hurricast into an operational forecast consensus model leads to a significant improvement of 5% - 15% over NHC's official forecast, thus highlighting the complementary properties with existing approaches. In summary, our work demonstrates that combining different data sources and distinct machine learning methodologies can lead to superior tropical cyclone forecasting.


Margins are Insufficient for Explaining Gradient Boosting

arXiv.org Machine Learning

Boosting is one of the most successful ideas in machine learning, achieving great practical performance with little fine-tuning. The success of boosted classifiers is most often attributed to improvements in margins. The focus on margin explanations was pioneered in the seminal work by Schapire et al. (1998) and has culminated in the $k$'th margin generalization bound by Gao and Zhou (2013), which was recently proved to be near-tight for some data distributions (Gronlund et al. 2019). In this work, we first demonstrate that the $k$'th margin bound is inadequate in explaining the performance of state-of-the-art gradient boosters. We then explain the short comings of the $k$'th margin bound and prove a stronger and more refined margin-based generalization bound for boosted classifiers that indeed succeeds in explaining the performance of modern gradient boosters. Finally, we improve upon the recent generalization lower bound by Gr{\o}nlund et al. (2019).


The Macroeconomy as a Random Forest

arXiv.org Machine Learning

I develop Macroeconomic Random Forest (MRF), an algorithm adapting the canonical Machine Learning (ML) tool to flexibly model evolving parameters in a linear macro equation. Its main output, Generalized Time-Varying Parameters (GTVPs), is a versatile device nesting many popular nonlinearities (threshold/switching, smooth transition, structural breaks/change) and allowing for sophisticated new ones. The approach delivers clear forecasting gains over numerous alternatives, predicts the 2008 drastic rise in unemployment, and performs well for inflation. Unlike most ML-based methods, MRF is directly interpretable -- via its GTVPs. For instance, the successful unemployment forecast is due to the influence of forward-looking variables (e.g., term spreads, housing starts) nearly doubling before every recession. Interestingly, the Phillips curve has indeed flattened, and its might is highly cyclical.


Residual Likelihood Forests

arXiv.org Machine Learning

Ensemble and Boosting methods such as Random Forests [3] and AdaBoost [19] are often recognized as some of the best out-of-the-box classifiers, consistently achieving state-ofthe-art performance across a wide range of computer vision tasks including applications in image classification [1], semantic segmentation [22], object recognition [12] and data clustering [16]. The success of these methods is attributed to their ability to learn models (strong learners) which possess low bias and variance through the combination of weakly correlated learners (weak learners). Forests reduce variance through averaging its weak learners over the ensemble. Boosting, on the other hand, looks towards reducing both bias and variance through sequentially optimizing under conditional constraints. The commonality between both approaches is in the way each learner is constructed: both methods use a top-down induction algorithm (such as CART [4]) which greedily learns decision nodes in a recursive manner. This approach is known to be suboptimal in terms of objective maximization as there are no guarantees that a global loss is being minimized [14]. In practice, this type of optimization requires the non-linearity offered by several (very) deep trees, which results in redundancy in learned models with large overlaps of information between weak learners. To address these limitations, the ensemble approaches of [11, 20] have utilized gradient information within a boosting framework. This allows weak learners to be fit via pseudoresiduals or to a set of adaptive weights and allows for the minimization of a global loss via gradient descent.


Brain Predictability toolbox: a Python library for neuroimaging based machine learning

arXiv.org Machine Learning

Summary Brain Predictability toolbox (BPt) represents a unified framework of machine learning (ML) tools designed to work with both tabulated data (in particular brain, psychiatric, behavioral, and physiological variables) and neuroimaging specific derived data (e.g., brain volumes and surfaces). This package is suitable for investigating a wide range of different neuroimaging based ML questions, in particular, those queried from large human datasets. Availability and Implementation BPt has been developed as an open-source Python 3.6+ package hosted at https://github.com/sahahn/BPt under MIT License, with documentation provided at https://bpt.readthedocs.io/en/latest/, and continues to be actively developed. The project can be downloaded through the github link provided. A web GUI interface based on the same code is currently under development and can be set up through docker with instructions at https://github.com/sahahn/BPt_app. Contact Please contact Sage Hahn at sahahn@uvm.edu


Imbalanced-learn: Handling imbalanced class problem

#artificialintelligence

In the previous article here, we have gone through the different methods to deal with imbalanced data. In this article, let us try to understand how to use imbalanced-learn library to deal with imbalanced class problems. We will make use of Pycaret library and UCI's default of credit card client dataset which is also in-built into PyCaret. Imbalanced-learn is a python package that provides a number of re-sampling techniques to deal with class imbalance problems commonly encountered in classification tasks. Note that imbalanced-learn is compatible with scikit-learn and is also part of scikit-learn-contrib projects.


Targeting for long-term outcomes

arXiv.org Machine Learning

Decision-makers often want to target interventions (e.g., marketing campaigns) so as to maximize an outcome that is observed only in the long-term. This typically requires delaying decisions until the outcome is observed or relying on simple short-term proxies for the long-term outcome. Here we build on the statistical surrogacy and off-policy learning literature to impute the missing long-term outcomes and then approximate the optimal targeting policy on the imputed outcomes via a doubly-robust approach. We apply our approach in large-scale proactive churn management experiments at The Boston Globe by targeting optimal discounts to its digital subscribers to maximize their long-term revenue. We first show that conditions for validity of average treatment effect estimation with imputed outcomes are also sufficient for valid policy evaluation and optimization; furthermore, these conditions can be somewhat relaxed for policy optimization. We then validate this approach empirically by comparing it with a policy learned on the ground truth long-term outcomes and show that they are statistically indistinguishable. Our approach also outperforms a policy learned on short-term proxies for the long-term outcome. In a second field experiment, we implement the optimal targeting policy with additional randomized exploration, which allows us to update the optimal policy for each new cohort of customers to account for potential non-stationarity. Over three years, our approach had a net-positive revenue impact in the range of $4-5 million compared to The Boston Globe's current policies.


Complete Guide To XGBoost With Implementation In R

#artificialintelligence

In recent times, ensemble techniques have become popular among data scientists and enthusiasts. Until now Random Forest and Gradient Boosting algorithms were winning the data science competitions and hackathons, over the period of the last few years XGBoost has been performing better than other algorithms on problems involving structured data. Apart from its performance, XGBoost is also recognized for its speed, accuracy and scale. XGBoost is developed on the framework of Gradient Boosting. Just like other boosting algorithms XGBoost uses decision trees for its ensemble model.


Pay as you go machine learning inference with AWS Lambda

#artificialintelligence

This post is courtesy of Eitan Sela, Senior Startup Solutions Architect. Many customers want to deploy machine learning models for real-time inference, and pay only for what they use. Using Amazon EC2 instances for real-time inference may not be cost effective to support sporadic inference requests throughout the day. AWS Lambda is a serverless compute service with pay-per-use billing. However, ML frameworks like XGBoost are too large to fit into the 250 MB application artifact size limit, or the 512 MB /tmp space limit.


Versatile Verification of Tree Ensembles

arXiv.org Artificial Intelligence

Machine learned models often must abide by certain requirements (e.g., fairness or legal). This has spurred interested in developing approaches that can provably verify whether a model satisfies certain properties. This paper introduces a generic algorithm called Veritas that enables tackling multiple different verification tasks for tree ensemble models like random forests (RFs) and gradient boosting decision trees (GBDTs). This generality contrasts with previous work, which has focused exclusively on either adversarial example generation or robustness checking. Veritas formulates the verification task as a generic optimization problem and introduces a novel search space representation. Veritas offers two key advantages. First, it provides anytime lower and upper bounds when the optimization problem cannot be solved exactly. In contrast, many existing methods have focused on exact solutions and are thus limited by the verification problem being NP-complete. Second, Veritas produces full (bounded suboptimal) solutions that can be used to generate concrete examples. We experimentally show that Veritas outperforms the previous state of the art by (a) generating exact solutions more frequently, (b) producing tighter bounds when (a) is not possible, and (c) offering orders of magnitude speed ups. Subsequently, Veritas enables tackling more and larger real-world verification scenarios.