Regression
Model Parameters and Hyperparameters in Machine Learning -- What is the difference? - WebSystemer.no
For example, suppose you want to build a simple linear regression model using an m-dimensional training data set. If the model uses the gradient descent algorithm to minimize the objective function in order to determine the weights w_0, w_1, w_2, โฆ,w_m, then we can have an optimizer such as GradientDescent(eta, n_iter). Here eta (learning rate) and n_iter (number of iterations) are the hyperparameters that would have to be adjusted in order to obtain the best values for the model parameters w_0, w_1, w_2, โฆ,w_m. For more information about this, see the following example: Machine Learning: Python Linear Regression Estimator Using Gradient Descent. Here, n_iter is the number of iterations, eta0 is the learning rate, and eed of the pseudo random number generator to use when shuffling the data.
Statistical Modeling -- The Full Pragmatic Guide
Continuing Our series of posts on how to interpret Machine Learning algorithms and predictions. Part 0 (optional) -- What is Data Science and the Data Scientist Part 1 -- Introduction to Interpretability Part 1.5 (optional) -- A Brief History of Statistics (May be useful to understand this post) Part 2 -- (this post) Interpreting models of high bias and low variance. Part 4 -- Is it possible to resolve the trade-off between bias and variance? Using Shapley to finally open the black box! In this post we will focus on the interpretation of high bias and low variance models, as we explained in the previous post, these algorithms are the easiest to interpret so assume several prerequisites in the data.
Sample Complexity of Learning Mixtures of Sparse Linear Regressions
Krishnamurthy, Akshay, Mazumdar, Arya, McGregor, Andrew, Pal, Soumyabrata
In the problem of learning mixtures of linear regressions, the goal is to learn a collection of signal vectors from a sequence of (possibly noisy) linear measurements, where each measurement is evaluated on an unknown signal drawn uniformly from this collection. This setting is quite expressive and has been studied both in terms of practical applications and for the sake of establishing theoretical guarantees. In this paper, we consider the case where the signal vectors are sparse; this generalizes the popular compressed sensing paradigm. We improve upon the state-of-the-art results as follows: In the noisy case, we resolve an open question of Yin et al. (IEEE Transactions on Information Theory, 2019) by showing how to handle collections of more than two vectors and present the first robust reconstruction algorithm, i.e., if the signals are not perfectly sparse, we still learn a good sparse approximation of the signals. In the noiseless case, as well as in the noisy case, we show how to circumvent the need for a restrictive assumption required in the previous work. Our techniques are quite different from those in the previous work: for the noiseless case, we rely on a property of sparse polynomials and for the noisy case, we provide new connections to learning Gaussian mixtures and use ideas from the theory of error-correcting codes.
Bounding Data-driven Model Errors in Power Grid Analysis
Liu, Yuxiao, Xu, Bolun, Botterud, Audun, Zhang, Ning, Kang, Chongqing
Data-driven models analyze power grids under incomplete physical information, and their accuracy has been mostly validated empirically using certain training and testing datasets. This paper explores error bounds for data-driven models under all possible training and testing scenarios, and proposes an evaluation implementation based on Rademacher complexity theory. We answer key questions for data-driven models: how much training data is required to guarantee a certain error bound, and how partial physical knowledge can be utilized to reduce the required amount of data. Our results are crucial for the evaluation and application of data-driven models in power grid analysis. We demonstrate the proposed method by finding generalization error bounds for two applications, i.e. branch flow linearization and external network equivalent under different degrees of physical knowledge. Results identify how the bounds decrease with additional power grid physical knowledge or more training data.
Differentially Private Bayesian Linear Regression
Bernstein, Garrett, Sheldon, Daniel
Linear regression is an important tool across many fields that work with sensitive human-sourced data. Significant prior work has focused on producing differentially private point estimates, which provide a privacy guarantee to individuals while still allowing modelers to draw insights from data by estimating regression coefficients. We investigate the problem of Bayesian linear regression, with the goal of computing posterior distributions that correctly quantify uncertainty given privately released statistics. We show that a naive approach that ignores the noise injected by the privacy mechanism does a poor job in realistic data settings. We then develop noise-aware methods that perform inference over the privacy mechanism and produce correct posteriors across a wide range of scenarios.
Learning Fair and Interpretable Representations via Linear Orthogonalization
He, Yuzi, Burghardt, Keith, Lerman, Kristina
To reduce human error and prejudice, many high-stakes decisions have been turned over to machine algorithms. However, recent research suggests that this does not remove discrimination, and can perpetuate harmful stereotypes. While algorithms have been developed to improve fairness, they typically face at least one of three shortcomings: they are not interpretable, they lose significant accuracy compared to unbiased equivalents, or they are not transferable across models. To address these issues, we propose a geometric method that removes correlations between data and any number of protected variables. Further, we can control the strength of debi-asing through an adjustable parameter to address the tradeoff between model accuracy and fairness. The resulting features are interpretable and can be used with many popular models, such as linear regression, random forest and multilayer perceptrons. The resulting predictions are found to be more accurate and fair than several comparable fair AI algorithms across a variety of benchmark datasets. Our work shows that debiasing data is a simple and effective solution toward improving fairness.
A First-Order Algorithmic Framework for Wasserstein Distributionally Robust Logistic Regression
Li, Jiajin, Huang, Sen, So, Anthony Man-Cho
Wasserstein distance-based distributionally robust optimization (DRO) has received much attention lately due to its ability to provide a robustness interpretation of various learning models. Moreover, many of the DRO problems that arise in the learning context admits exact convex reformulations and hence can be tackled by off-the-shelf solvers. Nevertheless, the use of such solvers severely limits the applicability of DRO in large-scale learning problems, as they often rely on general purpose interior-point algorithms. On the other hand, there are very few works that attempt to develop fast iterative methods to solve these DRO problems, which typically possess complicated structures. In this paper, we take a first step towards resolving the above difficulty by developing a first-order algorithmic framework for tackling a class of Wasserstein distance-based distributionally robust logistic regression (DRLR) problem. Specifically, we propose a novel linearized proximal ADMM to solve the DRLR problem, whose objective is convex but consists of a smooth term plus two non-separable non-smooth terms. We prove that our method enjoys a sublinear convergence rate. Furthermore, we conduct three different experiments to show its superb performance on both synthetic and real-world datasets. In particular, our method can achieve the same accuracy up to 800+ times faster than the standard off-the-shelf solver.
The Study of Machine Learning Models in Predicting the Intention of Adolescents to Smoke Cigarettes
Nam, Seung Joon, Kim, Han Min, Kang, Thomas, Park, Cheol Young
The use of electronic cigarette (e-cigarette) is increasing among adolescents. This is problematic since consuming nicotine at an early age can cause harmful effects in developing teenager's brain and health. Additionally, the use of e-cigarette has a possibility of leading to the use of cigarettes, which is more severe. There were many researches about e-cigarette and cigarette that mostly focused on finding and analyzing causes of smoking using conventional statistics. However, there is a lack of research on developing prediction models, which is more applicable to anti-smoking campaign, about e-cigarette and cigarette. In this paper, we research the prediction models that can be used to predict an individual e-cigarette user's (including non-e-cigarette users) intention to smoke cigarettes, so that one can be early informed about the risk of going down the path of smoking cigarettes. To construct the prediction models, five machine learning (ML) algorithms are exploited and tested for their accuracy in predicting the intention to smoke cigarettes among never smokers using data from the 2018 National Youth Tobacco Survey (NYTS). In our investigation, the Gradient Boosting Classifier, one of the prediction models, shows the highest accuracy out of all the other models. Also, with the best prediction model, we made a public website that enables users to input information to predict their intentions of smoking cigarettes.
Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification
Gao, Yingbo, Herold, Christian, Wang, Weiyue, Ney, Hermann
Prominently used in support vector machines and logistic regressions, kernel functions (kernels) can implicitly map data points into high dimensional spaces and make it easier to learn complex decision boundaries. In this work, by replacing the inner product function in the softmax layer, we explore the use of kernels for contextual word classification. In order to compare the individual kernels, experiments are conducted on standard language modeling and machine translation tasks. We observe a wide range of performances across different kernel settings. Extending the results, we look at the gradient properties, investigate various mixture strategies and examine the disambiguation abilities.
A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation
Logistic regression is a model for binary classification predictive modeling. The parameters of a logistic regression model can be estimated by the probabilistic framework called maximum likelihood estimation. Under this framework, a probability distribution for the target variable (class label) must be assumed and then a likelihood function defined that calculates the probability of observing the outcome given the input data and the model. This function can then be optimized to find the set of parameters that results in the largest sum likelihood over the training dataset. The maximum likelihood approach to fitting a logistic regression model both aids in better understanding the form of the logistic regression model and provides a template that can be used for fitting classification models more generally.