Regression
Multilevel and Mixed Models Fall 2016
Multilevel models are a class of regression models for data that have a hierarchical (or nested) structure. Common examples of such data structures are students nested within schools or classrooms, patients nested within hospitals, or survey respondents nested within countries. Using regression techniques that ignore this hierarchical structure (such as ordinary least squares) can lead to incorrect results because such methods assume that all observations are independent. Perhaps more important, using inappropriate techniques (like pooling or aggregating) prevents researchers from asking substantively interesting questions about how processes work at different levels. This two-day seminar provides an intensive introduction to multilevel models.
Stealing Machine Learning Models via Prediction APIs
Tramรจr, Florian, Zhang, Fan, Juels, Ari, Reiter, Michael K., Ristenpart, Thomas
Machine learning (ML) models may be deemed confidential due to their sensitive training data, commercial value, or use in security applications. Increasingly often, confidential ML models are being deployed with publicly accessible query interfaces. ML-as-a-service ("predictive analytics") systems are an example: Some allow users to train models on potentially sensitive data and charge others for access on a pay-per-query basis. The tension between model confidentiality and public access motivates our investigation of model extraction attacks. In such attacks, an adversary with black-box access, but no prior knowledge of an ML model's parameters or training data, aims to duplicate the functionality of (i.e., "steal") the model. Unlike in classical learning theory settings, ML-as-a-service offerings may accept partial feature vectors as inputs and include confidence values with predictions. Given these practices, we show simple, efficient attacks that extract target ML models with near-perfect fidelity for popular model classes including logistic regression, neural networks, and decision trees. We demonstrate these attacks against the online services of BigML and Amazon Machine Learning. We further show that the natural countermeasure of omitting confidence values from model outputs still admits potentially harmful model extraction attacks. Our results highlight the need for careful ML model deployment and new model extraction countermeasures.
Estimating Delivery Times: A Case Study In Practical Machine Learning
Machine Learning is rapidly becoming a required and critical component of engineering organizations across the tech industry. From movie recommendation algorithms to self-driving cars, it is clearly an exciting and compelling field. Companies are hiring armies of Machine Learning researchers to solve difficult problems like voice and object recognition. What does this all mean to the average software engineer? In many cases, extremely specialized knowledge is necessary to outperform existing state-of-the-art systems.
Two-stage Sampling, Prediction and Adaptive Regression via Correlation Screening (SPARCS)
Firouzi, Hamed, Hero, Alfred, Rajaratnam, Bala
This paper proposes a general adaptive procedure for budget-limited predictor design in high dimensions called two-stage Sampling, Prediction and Adaptive Regression via Correlation Screening (SPARCS). SPARCS can be applied to high dimensional prediction problems in experimental science, medicine, finance, and engineering, as illustrated by the following. Suppose one wishes to run a sequence of experiments to learn a sparse multivariate predictor of a dependent variable $Y$ (disease prognosis for instance) based on a $p$ dimensional set of independent variables $\mathbf X=[X_1,\ldots, X_p]^T$ (assayed biomarkers). Assume that the cost of acquiring the full set of variables $\mathbf X$ increases linearly in its dimension. SPARCS breaks the data collection into two stages in order to achieve an optimal tradeoff between sampling cost and predictor performance. In the first stage we collect a few ($n$) expensive samples $\{y_i,\mathbf x_i\}_{i=1}^n$, at the full dimension $p\gg n$ of $\mathbf X$, winnowing the number of variables down to a smaller dimension $l < p$ using a type of cross-correlation or regression coefficient screening. In the second stage we collect a larger number $(t-n)$ of cheaper samples of the $l$ variables that passed the screening of the first stage. At the second stage, a low dimensional predictor is constructed by solving the standard regression problem using all $t$ samples of the selected variables. SPARCS is an adaptive online algorithm that implements false positive control on the selected variables, is well suited to small sample sizes, and is scalable to high dimensions. We establish asymptotic bounds for the Familywise Error Rate (FWER), specify high dimensional convergence rates for support recovery, and establish optimal sample allocation rules to the first and second stages.
Tuning Parameter Calibration in High-dimensional Logistic Regression With Theoretical Guarantees
Feature selection is a standard approach to understanding and modeling high-dimensional classification data, but the corresponding statistical methods hinge on tuning parameters that are difficult to calibrate. In particular, existing calibration schemes in the logistic regression framework lack any finite sample guarantees. In this paper, we introduce a novel calibration scheme for penalized logistic regression. It is based on simple tests along the tuning parameter path and satisfies optimal finite sample bounds. It is also amenable to easy and efficient implementations, and it rivals or outmatches existing methods in simulations and real data applications.
Logistic Regression versus Decision Trees
The question of which model type to apply to a Machine Learning task can be a daunting one given the immense number of algorithms available in the literature. It can be difficult to compare the relative merits of two methods, as one can outperform the other in a certain class of problems while consistently coming in behind for another class. In this post, the last one of our series of posts about Logistic Regression, we'll explore the differences between Decision Trees and Logistic Regression for classification problems, and try to highlight scenarios where one might be recommended over the other. Logistic Regression and trees differ in the way that they generate decision boundaries i.e. the lines that are drawn to separate different classes. To illustrate this difference, let's look at the results of the two model types on the following 2-class problem: Decision Trees bisect the space into smaller and smaller regions, whereas Logistic Regression fits a single line to divide the space exactly into two.
Python Machine Learning Mini-Course - Machine Learning Mastery
Python is one of the fastest-growing platforms for applied machine learning. In this mini-course, you will discover how you can get started, build accurate models and confidently complete predictive modeling machine learning projects using Python in 14 days. This is a big and important post. You might want to bookmark it. Python Machine Learning Mini-Course Photo by Dave Young, some rights reserved.
Spark Machine Learning Pipeline by Example - Hortonworks
As the release of Spark 2.0 finally came, the machine learning library of Spark has been changed from the mllib to ml. One of the biggest change in the new ml library is the introduction of so-called machine learning pipeline. It provides a high level abstraction of the machine learning flow and greatly simplified the creation of machine learning process. In this tutorial, we will walk through the steps on how to create a machine learning pipeline and also explain what is under the hood in the pipeline. In this tutorial, we will demonstrate the process to create a pipeline in Spark to predict airline flight delay.
Online Categorical Subspace Learning for Sketching Big Data with Misses
Shen, Yanning, Mardani, Morteza, Giannakis, Georgios B.
With the scale of data growing every day, reducing the dimensionality (a.k.a. sketching) of high-dimensional data has emerged as a task of paramount importance. Relevant issues to address in this context include the sheer volume of data that may consist of categorical samples, the typically streaming format of acquisition, and the possibly missing entries. To cope with these challenges, the present paper develops a novel categorical subspace learning approach to unravel the latent structure for three prominent categorical (bilinear) models, namely, Probit, Tobit, and Logit. The deterministic Probit and Tobit models treat data as quantized values of an analog-valued process lying in a low-dimensional subspace, while the probabilistic Logit model relies on low dimensionality of the data log-likelihood ratios. Leveraging the low intrinsic dimensionality of the sought models, a rank regularized maximum-likelihood estimator is devised, which is then solved recursively via alternating majorization-minimization to sketch high-dimensional categorical data `on the fly.' The resultant procedure alternates between sketching the new incomplete datum and refining the latent subspace, leading to lightweight first-order algorithms with highly parallelizable tasks per iteration. As an extra degree of freedom, the quantization thresholds are also learned jointly along with the subspace to enhance the predictive power of the sought models. Performance of the subspace iterates is analyzed for both infinite and finite data streams, where for the former asymptotic convergence to the stationary point set of the batch estimator is established, while for the latter sublinear regret bounds are derived for the empirical cost. Simulated tests with both synthetic and real-world datasets corroborate the merits of the novel schemes for real-time movie recommendation and chess-game classification.
Python Machine Learning Mini-Course
Python is one of the fastest-growing platforms for applied machine learning. In this mini-course, you will discover how you can get started, build accurate models and confidently complete predictive modeling machine learning projects using Python in 14 days. This is a big and important post. You might want to bookmark it. Python Machine Learning Mini-Course Photo by Dave Young, some rights reserved.