Goto

Collaborating Authors

 Regression


Loss Rate Forecasting Framework Based on Macroeconomic Changes: Application to US Credit Card Industry

arXiv.org Machine Learning

A major part of the balance sheets of the largest US banks consists of credit card portfolios. Hence, managing the charge-off rates is a vital task for the profitability of the credit card industry. Different macroeconomic conditions affect individuals' behavior in paying down their debts. In this paper, we propose an expert system for loss forecasting in the credit card industry using macroeconomic indicators. We select the indicators based on a thorough review of the literature and experts' opinions covering all aspects of the economy, consumer, business, and government sectors. The state of the art machine learning models are used to develop the proposed expert system framework. We develop two versions of the forecasting expert system, which utilize different approaches to select between the lags added to each indicator. Among 19 macroeconomic indicators that were used as the input, six were used in the model with optimal lags, and seven indicators were selected by the model using all lags. The features that were selected by each of these models covered all three sectors of the economy. Using the charge-off data for the top 100 US banks ranked by assets from the first quarter of 1985 to the second quarter of 2019, we achieve mean squared error values of 1.15E-03 and 1.04E-03 using the model with optimal lags and the model with all lags, respectively. The proposed expert system gives a holistic view of the economy to the practitioners in the credit card industry and helps them to see the impact of different macroeconomic conditions on their future loss.


Top 12 R Packages For Machine Learning In 2020

#artificialintelligence

R is one of the most prevalent programming languages for statistical analysis and computing. Researchers in the field of data science and statistical computing have been using this language for a few years now because of its number of intuitive features. These features include running code without a compiler, open-source, robust visualisation library, and other such. This article lists down the top 12 R packages for machine learning one must know in 2020. About: The Classification And REgression Training or caret package is a set of functions that seeks to streamline the method for creating predictive models.


Step-By-Step Guide On How To Build Linear Regression In R (With Code)

#artificialintelligence

It will also provide information about missing values or outliers if any. For more information and functions which you can use read beginner's guide to exploratory data analysis. Both missing values and outliers are of concern for Machine Learning models as they tend to push the result towards extreme values.


General-Purpose Differentially-Private Confidence Intervals

arXiv.org Machine Learning

One of the most common statistical goals is to estimate a population parameter and quantify uncertainty by constructing a confidence interval. However, the field of differential privacy lacks easy-to-use and general methods for doing so. We partially fill this gap by developing two broadly applicable methods for private confidence-interval construction. The first is based on asymptotics: for two widely used model classes, exponential families and linear regression, a simple private estimator has the same asymptotic normal distribution as the corresponding non-private estimator, so confidence intervals can be constructed using quantiles of the normal distribution. These are computationally cheap and accurate for large data sets, but do not have good coverage for small data sets. The second approach is based on the parametric bootstrap. It applies "out of the box" to a wide class of private estimators and has good coverage at small sample sizes, but with increased computational cost. Both methods are based on post-processing the private estimator and do not consume additional privacy budget.


Horseshoe Prior Bayesian Quantile Regression

arXiv.org Machine Learning

This paper extends the horseshoe prior of Carvalho et al. (2010) to the Bayesian quantile regression (HS-BQR) and provides a fast sampling algorithm that speeds up computation significantly in high dimensions. The performance of the HS-BQR is tested on large scale Monte Carlo simulations and an empirical application relevant to macroeoncomics. The Monte Carlo design considers several sparsity structures (sparse, dense, block) and error structures (i.i.d. errors and heteroskedastic errors). A number of LASSO based estimators (frequentist and Bayesian) are pitted against the HS-BQR to better gauge the performance of the method on the different designs. The HS-BQR yields just as good, or better performance than the other estimators considered when evaluated using coefficient bias and forecast error. We find that the HS-BQR is particularly potent in sparse designs and when estimating extreme quantiles. The simulations also highlight how the high dimensional quantile estimators fail to correctly identify the quantile function of the variables when both location and scale effects are present. In the empirical application, in which we evaluate forecast densities of US inflation, the HS-BQR provides well calibrated forecast densities whose individual quantiles, have the highest pseudo R squared, highlighting its potential for Value-at-Risk estimation.


Generalizing Gain Penalization for Feature Selection in Tree-based Models

arXiv.org Machine Learning

We develop a new approach for feature selection via gain penalization in tree-based models. First, we show that previous methods do not perform sufficient regularization and often exhibit sub-optimal out-of-sample performance, especially when correlated features are present. Instead, we develop a new gain penalization idea that exhibits a general local-global regularization for tree-based models. The new method allows for more flexibility in the choice of feature-specific importance weights. We validate our method on both simulated and real data and implement itas an extension of the popular R package ranger.


Real-Time Optimization Of Web Publisher RTB Revenues

arXiv.org Machine Learning

This paper describes an engine to optimize web publisher revenues from second-price auctions. These auctions are widely used to sell online ad spaces in a mechanism called real-time bidding (RTB). Optimization within these auctions is crucial for web publishers, because setting appropriate reserve prices can significantly increase revenue. We consider a practical real-world setting where the only available information before an auction occurs consists of a user identifier and an ad placement identifier. The real-world challenges we had to tackle consist mainly of tracking the dependencies on both the user and placement in an highly non-stationary environment and of dealing with censored bid observations. These challenges led us to make the following design choices: (i) we adopted a relatively simple non-parametric regression model of auction revenue based on an incremental time-weighted matrix factorization which implicitly builds adaptive users' and placements' profiles; (ii) we jointly used a non-parametric model to estimate the first and second bids' distribution when they are censored, based on an on-line extension of the Aalen's Additive model. Our engine is a component of a deployed system handling hundreds of web publishers across the world, serving billions of ads a day to hundreds of millions of visitors. The engine is able to predict, for each auction, an optimal reserve price in approximately one millisecond and yields a significant revenue increase for the web publishers.


Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales

arXiv.org Machine Learning

This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm. Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). However, several existing methods for OPE do not take this issue into account and are based on the assumption that samples are i.i.d. In this study, we address this problem by constructing an estimator from a standardized martingale difference sequence. To standardize the sequence, we consider using evaluation data or sample splitting with a two-step estimation. This technique produces an estimator with asymptotic normality without restricting a class of behavior policies. In an experiment, the proposed estimator performs better than existing methods, which assume that the behavior policy converges to a time-invariant policy.


Targeting Learning: Robust Statistics for Reproducible Research

arXiv.org Machine Learning

Targeted Learning is a subfield of statistics that unifies advances in causal inference, machine learning and statistical theory to help answer scientifically impactful questions with statistical confidence. Targeted Learning is driven by complex problems in data science and has been implemented in a diversity of real-world scenarios: observational studies with missing treatments and outcomes, personalized interventions, longitudinal settings with time-varying treatment regimes, survival analysis, adaptive randomized trials, mediation analysis, and networks of connected subjects. In contrast to the (mis)application of restrictive modeling strategies that dominate the current practice of statistics, Targeted Learning establishes a principled standard for statistical estimation and inference (i.e., confidence intervals and p-values). This multiply robust approach is accompanied by a guiding roadmap and a burgeoning software ecosystem, both of which provide guidance on the construction of estimators optimized to best answer the motivating question. The roadmap of Targeted Learning emphasizes tailoring statistical procedures so as to minimize their assumptions, carefully grounding them only in the scientific knowledge available. The end result is a framework that honestly reflects the uncertainty in both the background knowledge and the available data in order to draw reliable conclusions from statistical analyses -- ultimately enhancing the reproducibility and rigor of scientific findings.


Symbolic Regression using Mixed-Integer Nonlinear Optimization

arXiv.org Machine Learning

The Symbolic Regression (SR) problem, where the goal is to find a regression function that does not have a pre-specified form but is any function that can be composed of a list of operators, is a hard problem in machine learning, both theoretically and computationally. Genetic programming based methods, that heuristically search over a very large space of functions, are the most commonly used methods to tackle SR problems. An alternative mathematical programming approach, proposed in the last decade, is to express the optimal symbolic expression as the solution of a system of nonlinear equations over continuous and discrete variables that minimizes a certain objective, and to solve this system via a global solver for mixed-integer nonlinear programming problems. Algorithms based on the latter approach are often very slow. We propose a hybrid algorithm that combines mixed-integer nonlinear optimization with explicit enumeration and incorporates constraints from dimensional analysis. We show that our algorithm is competitive, for some synthetic data sets, with a state-of-the-art SR software and a recent physics-inspired method called AI Feynman.