Regression
A Continuum of Optimal Primal-Dual Algorithms for Convex Composite Minimization Problems with Applications to Structured Sparsity
Ko, Seyoon, Yu, Donghyeon, Won, Joong-Ho
Many statistical learning problems can be posed as minimization of a sum of two convex functions, one typically a composition of non-smooth and linear functions. Examples include regression under structured sparsity assumptions. Popular algorithms for solving such problems, e.g., ADMM, often involve non-trivial optimization subproblems or smoothing approximation. We consider two classes of primal-dual algorithms that do not incur these difficulties, and unify them from a perspective of monotone operator theory. From this unification we propose a continuum of preconditioned forward-backward operator splitting algorithms amenable to parallel and distributed computing. For the entire region of convergence of the whole continuum of algorithms, we establish its rates of convergence. For some known instances of this continuum, our analysis closes the gap in theory. We further exploit the unification to propose a continuum of accelerated algorithms. We show that the whole continuum attains the theoretically optimal rate of convergence. The scalability of the proposed algorithms, as well as their convergence behavior, is demonstrated up to 1.2 million variables with a distributed implementation.
Linear Regression in Python; Predict The Bay Area's Home Prices
I chose the Bay Area housing price dataset that was sourced from Bay Area Home Sales Database and Zillow. This dataset was based on the homes sold between January 2013 and December 2015. It has many characteristics of learning. The dataset can be downloaded from here. There are several features that we do not need, such as "info", "z_address", "zipcode"(We have "neighborhood" as a location variable), "zipid" and "zestimate"(This is the price estimated by Zillow, we don't want our model to be affected by this).
When to Categorize Continuous Predictor in a Regression Model?
Research fields usually follow the practice of categorizing continuous predictor variables, and they are the same who mostly use ANOVA. They often do it through median splits, the high value above the median and the low values below the median. The way out of this dilemma is to be able to conclude whether to treat an independent variable as categorical or continuous. Data analysts are empowered to find real results which otherwise they might miss, is by knowing when it is appropriate, followed with the understanding of how it will affect the interpretation of parameters. Let's understand and accept the fact that general linear model is not concerned if the predictor you used is continuous or categorical. But you as a data analyst should choose the information you need from the analysis based on the coding of the predictor.
Big Data Classification Using Augmented Decision Trees
Sambasivan, Rajiv, Das, Sourish
We present an algorithm for classification tasks on big data. Experiments conducted as part of this study indicate that the algorithm can be as accurate as ensemble methods such as random forests or gradient boosted trees. Unlike ensemble methods, the models produced by the algorithm can be easily interpreted. The algorithm is based on a divide and conquer strategy and consists of two steps. The first step consists of using a decision tree to segment the large dataset. By construction, decision trees attempt to create homogeneous class distributions in their leaf nodes. However, non-homogeneous leaf nodes are usually produced. The second step of the algorithm consists of using a suitable classifier to determine the class labels for the non-homogeneous leaf nodes. The decision tree segment provides a coarse segment profile while the leaf level classifier can provide information about the attributes that affect the label within a segment.
MEBoost: Variable Selection in the Presence of Measurement Error
Brown, Benjamin, Weaver, Timothy, Wolfson, Julian
We present a novel method for variable selection in regression models when covariates are measured with error. The iterative algorithm we propose, MEBoost, follows a path defined by estimating equations that correct for covariate measurement error. Via simulation, we evaluated our method and compare its performance to the recently-proposed Convex Conditioned Lasso (CoCoLasso) and to the "naive" Lasso which does not correct for measurement error. Increasing the degree of measurement error increased prediction error and decreased the probability of accurate covariate selection, but this loss of accuracy was least pronounced when using MEBoost. We illustrate the use of MEBoost in practice by analyzing data from the Box Lunch Study, a clinical trial in nutrition where several variables are based on self-report and hence measured with error.
Intro to TensorFlow in R - Edgar's Data Lab
TensorFlow is a very powerful and flexible architecture. It provides the building blocks to create and fit basically any machine learning algorithm. But even a simple linear regression model has to be built "from scratch" using layers and estimators in TensorFlow. TensorFlow has a high-level API that provides "canned models" which, in my opinion, lowers the barrier to entry into experimenting with TensorFlow. And of course, R users are now able to access this API via the tfestimators package.
Calibration of Machine Learning Classifiers for Probability of Default Modelling
Fonseca, Pedro G., Lopes, Hugo D.
Binary classification is highly used in credit scoring in the estimation of probability of default. The validation of such predictive models is based both on rank ability, and also on calibration (i.e. how accurately the probabilities output by the model map to the observed probabilities). In this study we cover the current best practices regarding calibration for binary classification, and explore how different approaches yield different results on real world credit scoring data. The limitations of evaluating credit scoring models using only rank ability metrics are explored. A benchmark is run on 18 real world datasets, and results compared. The calibration techniques used are Platt Scaling and Isotonic Regression. Also, different machine learning models are used: Logistic Regression, Random Forest Classifiers, and Gradient Boosting Classifiers. Results show that when the dataset is treated as a time series, the use of re-calibration with Isotonic Regression is able to improve the long term calibration better than the alternative methods. Using re-calibration, the non-parametric models are able to outperform the Logistic Regression on Brier Score Loss.
40 Interview Questions asked at Startups in Machine Learning / Data Science
These question can make you think THRICE! Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today. This also means that there are numerous exciting startups looking for data scientists. What could be a better start for your aspiring career! However, still, getting into these roles is not easy. You obviously need to get excited about the idea, team and the vision of the company. You might also find some real difficult techincal questions on your way. The set of questions asked depend on what does the startup do. Do they build ML products? You should always find this out prior to beginning your interview preparation. To help you prepare for your next interview, I've prepared a list of 40 plausible & tricky questions which are likely to come across your way in interviews. If you can answer and understand these question, rest assured, you will give a tough fight in your job interview. Note: A key to answer these questions is to have concrete practical understanding on ML and related statistical concepts.
Tree Boosting With XGBoost – Why Does XGBoost Win "Every" Machine Learning Competition?
Tree boosting has empirically proven to be efficient for predictive mining for both classification and regression. For many years, MART (multiple additive regression trees) has been the tree boosting method of choice. But a starting from 2015, a first to try, always winning algorithm surged to the surface: XGBoost. This algorithm re-implements the tree boosting and gained popularity by winning Kaggle and other data science competition. The paper introduce in first place the supervised learning task and discuss the model selection techniques.
Display advertising: Estimating conversion probability efficiently
Safari, Abdollah, Altman, Rachel MacKay, Loughin, Thomas M.
The goal of online display advertising is to entice users to "convert" (i.e., take a pre-defined action such as making a purchase) after clicking on the ad. An important measure of the value of an ad is the probability of conversion. The focus of this paper is the development of a computationally efficient, accurate, and precise estimator of conversion probability. The challenges associated with this estimation problem are the delays in observing conversions and the size of the data set (both number of observations and number of predictors). Two models have previously been considered as a basis for estimation: A logistic regression model and a joint model for observed conversion statuses and delay times. Fitting the former is simple, but ignoring the delays in conversion leads to an under-estimate of conversion probability. On the other hand, the latter is less biased but computationally expensive to fit. Our proposed estimator is a compromise between these two estimators. We apply our results to a data set from Criteo, a commerce marketing company that personalizes online display advertisements for users.