Regression
Large-Scale Classification using Multinomial Regression and ADMM
Fung, Samy Wu, Tyrväinen, Sanna, Ruthotto, Lars, Haber, Eldad
We present a novel method for learning the weights in multinomial logistic regression based on the alternating direction method of multipliers (ADMM). In each iteration, our algorithm decomposes the training into three steps; a linear least-squares problem for the weights, a global variable update involving a separable cross-entropy loss function, and a trivial dual variable update The least-squares problem can be factorized in the off-line phase, and the separability in the global variable update allows for efficient parallelization, leading to faster convergence. We compare our method with stochastic gradient descent for linear classification as well as for transfer learning and show that the proposed ADMM-Softmax leads to improved generalization and convergence.
Information-Theoretic Understanding of Population Risk Improvement with Model Compression
Bu, Yuheng, Gao, Weihao, Zou, Shaofeng, Veeravalli, Venugopal V.
We show that model compression can improve the population risk of a pre-trained model, by studying the tradeoff between the decrease in the generalization error and the increase in the empirical risk with model compression. We first prove that model compression reduces an information-theoretic bound on the generalization error; this allows for an interpretation of model compression as a regularization technique to avoid overfitting. We then characterize the increase in empirical risk with model compression using rate distortion theory. These results imply that the population risk could be improved by model compression if the decrease in generalization error exceeds the increase in empirical risk. We show through a linear regression example that such a decrease in population risk due to model compression is indeed possible. Our theoretical results further suggest that the Hessian-weighted $K$-means clustering compression approach can be improved by regularizing the distance between the clustering centers. We provide experiments with neural networks to support our theoretical assertions.
40 Interview Questions asked at Startups in Machine Learning / Data Science
These questions can make you think THRICE! Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today. This also means that there are numerous exciting startups looking for data scientists. What could be a better start for your aspiring career! However, still, getting into these roles is not easy. You obviously need to get excited about the idea, team and the vision of the company. You might also find some real difficult techincal questions on your way. The set of questions asked depend on what does the startup do. Do they build ML products? You should always find this out prior to beginning your interview preparation. To help you prepare for your next interview, I've prepared a list of 40 plausible & tricky questions which are likely to come across your way in interviews. If you can answer and understand these question, rest assured, you will give a tough fight in your job interview. Note: A key to answer these questions is to have concrete practical understanding on ML and related statistical concepts. You can get that know-how in our course'Introduction to Data Science'!
what_nns_learn.html
Neural networks are famously difficult to interpret. It's hard to know what they are actually learning when we train them. Let's take a closer look and see whether we can build a good picture of what's going on inside. Just like every other supervised machine learning model, neural networks learn relationships between input variables and output variables. In fact, we can even see how it's related to the most iconic model of all, linear regression. Linear regression assumes a straight line relationship between an input variable x and an output variable y. x is multiplied by a constant, m, which also happens to be the slope of the line, and it's added to another constant, b, which happens to be where the line crosses the y axis. We can represent this in a picture. Our input value x is multiplied by m. Our constant b, is multiplied by one. And then they are added together to get y.
Learning Models from Data with Measurement Error: Tackling Underreporting
Adams, Roy, Ji, Yuelong, Wang, Xiaobin, Saria, Suchi
Measurement error in observational datasets can lead to systematic bias in inferences based on these datasets. As studies based on observational data are increasingly used to inform decisions with real-world impact, it is critical that we develop a robust set of techniques for analyzing and adjusting for these biases. In this paper we present a method for estimating the distribution of an outcome given a binary exposure that is subject to underreporting. Our method is based on a missing data view of the measurement error problem, where the true exposure is treated as a latent variable that is marginalized out of a joint model. We prove three different conditions under which the outcome distribution can still be identified from data containing only error-prone observations of the exposure. We demonstrate this method on synthetic data and analyze its sensitivity to near violations of the identifiability conditions. Finally, we use this method to estimate the effects of maternal smoking and opioid use during pregnancy on childhood obesity, two import problems from public health. Using the proposed method, we estimate these effects using only subject-reported drug use data and substantially refine the range of estimates generated by a sensitivity analysis-based approach. Further, the estimates produced by our method are consistent with existing literature on both the effects of maternal smoking and the rate at which subjects underreport smoking.
Orthogonal Statistical Learning
Foster, Dylan J., Syrgkanis, Vasilis
We provide excess risk guarantees for statistical learning in the presence of an unknown nuisance component. We analyze a two-stage sample splitting meta-algorithm that takes as input two arbitrary estimation algorithms: one for the target model and one for the nuisance model. We show that if the population risk satisfies a condition called Neyman orthogonality, the impact of the first stage error on the excess risk bound achieved by the meta-algorithm is of second order. Our general theorem is agnostic to the particular algorithms used for the target and nuisance and only makes an assumption on their individual performance. This enables the use of a plethora of existing results from statistical learning and machine learning literature to give new guarantees for learning with a nuisance component. Moreover, by focusing on excess risk rather than parameter estimation, we can give guarantees under weaker assumptions than in previous works and accommodate the case where the target parameter belongs to a complex nonparametric class. When the nuisance and target parameters belong to arbitrary classes, we characterize conditions on the metric entropy such that oracle rates---rates of the same order as if we knew the nuisance model---are achieved. We also analyze the rates achieved by specific estimation algorithms such as variance-penalized empirical risk minimization, neural network estimation and sparse high-dimensional linear model estimation. We highlight the applicability of our results via four applications of primary importance: 1) heterogeneous treatment effect estimation, 2) offline policy optimization, 3) domain adaptation, and 4) learning with missing data.
A Zero-Shot Learning application in Deep Drawing process using Hyper-Process Model
One of the consequences of passing from mass production to mass customization paradigm in the nowadays industrialized world is the need to increase flexibility and responsiveness of manufacturing companies. The high-mix / low-volume production forces constant accommodations of unknown product variants, which ultimately leads to high periods of machine calibration. The difficulty related with machine calibration is that experience is required together with a set of experiments to meet the final product quality. Unfortunately, all possible combinations of machine parameters is so high that is difficult to build empirical knowledge. Due to this fact, normally trial and error approaches are taken making one-of-a-kind products not viable. Therefore, a Zero-Shot Learning (ZSL) based approach called hyper-process model (HPM) to learn the relation among multiple tasks is used as a way to shorten the calibration phase. Assuming each product variant is a task to solve, first, a shape analysis on data to learn common modes of deformation between tasks is made, and secondly, a mapping between these modes and task descriptions is performed. Ultimately, the present work has two main contributions: 1) Formulation of an industrial problem into a ZSL setting where new process models can be generated for process optimization and 2) the definition of a regression problem in the domain of ZSL. For that purpose, a 2-d deep drawing simulated process was used based on data collected from the Abaqus simulator, where a significant number of process models were collected to test the effectiveness of the approach. The obtained results show that is possible to learn new tasks without any available data (both labeled and unlabeled) by leveraging information about already existing tasks, allowing to speed up the calibration phase and make a quicker integration of new products into manufacturing systems.
A Review on Quantile Regression for Stochastic Computer Experiments
Torossian, Léonard, Picheny, Victor, Faivre, Robert, Garivier, Aurélien
We report on an empirical study of the main strategies for conditional quantile estimation in the context of stochastic computer experiments. To ensure adequate diversity, six metamodels are presented, divided into three categories based on order statistics, functional approaches, and those of Bayesian inspiration. The metamodels are tested on several problems characterized by the size of the training set, the input dimension, the quantile order and the value of the probability density function in the neighborhood of the quantile. The metamodels studied reveal good contrasts in our set of 480 experiments, enabling several patterns to be extracted. Based on our results, guidelines are proposed to allow users to select the best method for a given problem.
Thirty Years of Machine Learning:The Road to Pareto-Optimal Next-Generation Wireless Networks
Wang, Jingjing, Jiang, Chunxiao, Zhang, Haijun, Ren, Yong, Chen, Kwang-Cheng, Hanzo, Lajos
Next-generation wireless networks (NGWN) have a substantial potential in terms of supporting a broad range of complex compelling applications both in military and civilian fields, where the users are able to enjoy high-rate, low-latency, low-cost and reliable information services. Achieving this ambitious goal requires new radio techniques for adaptive learning and intelligent decision making because of the complex heterogeneous nature of the network structures and wireless services. Machine learning algorithms have great success in supporting big data analytics, efficient parameter estimation and interactive decision making. Hence, in this article, we review the thirty-year history of machine learning by elaborating on supervised learning, unsupervised learning, reinforcement learning and deep learning, respectively. Furthermore, we investigate their employment in the compelling applications of NGWNs, including heterogeneous networks (HetNets), cognitive radios (CR), Internet of things (IoT), machine to machine networks (M2M), and so on. This article aims for assisting the readers in clarifying the motivation and methodology of the various machine learning algorithms, so as to invoke them for hitherto unexplored services as well as scenarios of future wireless networks.