Regression
Bayesian Approximate Kernel Regression with Variable Selection
Crawford, Lorin, Wood, Kris C., Zhou, Xiang, Mukherjee, Sayan
Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an effect size analog of each explanatory variable for Bayesian kernel regression models when the kernel is shift-invariant --- for example, the Gaussian kernel. We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. The projection onto the original explanatory variables serves as an analog of effect sizes. The specific function analytic property we use is that shift-invariant kernel functions can be approximated via random Fourier bases. Based on the random Fourier expansion we propose a computationally efficient class of Bayesian approximate kernel regression (BAKR) models for both nonlinear regression and binary classification for which one can compute an analog of effect sizes. We illustrate the utility of BAKR by examining two important problems in statistical genetics: genomic selection (i.e. phenotypic prediction) and association mapping (i.e. inference of significant variants or loci). State-of-the-art methods for genomic selection and association mapping are based on kernel regression and linear models, respectively. BAKR is the first method that is competitive in both settings.
[P] Extracting input-to-output gradients from a Keras model • r/MachineLearning
Hi, so I am coming from a background in linear algebra and traditional numerical gradient-based optimization, but excited by the advancements that have been made in deep learning. To get my feet wet a bit, I made a pretty simple NN model to do some non-linear regressions for me. I uploaded my jupyter notebookit as a gist here (renders properly on github), which is pretty short and to the point. It just fits the 1D function y (x - 5)2 / 25. I know that Theano and Tensorflow are, at their core, graph based derivative (gradient) passing frameworks.
Collaborative Filtering with Side Information: a Gaussian Process Perspective
Kim, Hyunjik, Lu, Xiaoyu, Flaxman, Seth, Teh, Yee Whye
We tackle the problem of collaborative filtering (CF) with side information, through the lens of Gaussian Process (GP) regression. Driven by the idea of using the kernel to explicitly model user-item similarities, we formulate the GP in a way that allows the incorporation of low-rank matrix factorisation, arriving at our model, the Tucker Gaussian Process (TGP). Consequently, TGP generalises classical Bayesian matrix factorisation models, and goes beyond them to give a natural and elegant method for incorporating side information, giving enhanced predictive performance for CF problems. Moreover we show that it is a novel model for regression, especially well-suited to grid-structured data and problems where the dependence on covariates is close to being separable.
A budget-constrained inverse classification framework for smooth classifiers
Lash, Michael T., Lin, Qihang, Street, W. Nick, Robinson, Jennifer G.
Inverse classification is the process of manipulating an instance such that it is more likely to conform to a specific class. Past methods that address such a problem have shortcomings. Greedy methods make changes that are overly radical, often relying on data that is strictly discrete. Other methods rely on certain data points, the presence of which cannot be guaranteed. In this paper we propose a general framework and method that overcomes these and other limitations. The formulation of our method can use any differentiable classification function. We demonstrate the method by using logistic regression and Gaussian kernel SVMs. We constrain the inverse classification to occur on features that can actually be changed, each of which incurs an individual cost. We further subject such changes to fall within a certain level of cumulative change (budget). Our framework can also accommodate the estimation of (indirectly changeable) features whose values change as a consequence of actions taken. Furthermore, we propose two methods for specifying feature-value ranges that result in different algorithmic behavior. We apply our method, and a proposed sensitivity analysis-based benchmark method, to two freely available datasets: Student Performance from the UCI Machine Learning Repository and a real world cardiovascular disease dataset. The results obtained demonstrate the validity and benefits of our framework and method.
Top 20 Data Science MOOCs
Introduce yourself to the basics of data science and leave armed with practical experience extracting value from big data. This course teaches the basic techniques of data science, including both SQL and NoSQL solutions for massive data management (e.g., MapReduce and contemporaries), algorithms for data mining (e.g., clustering and association rule mining), and basic statistical modelling (e.g., linear and non-linear regression).
A Convex Framework for Fair Regression
Berk, Richard, Heidari, Hoda, Jabbari, Shahin, Joseph, Matthew, Kearns, Michael, Morgenstern, Jamie, Neel, Seth, Roth, Aaron
The widespread use of machine learning to make consequential decisions about individual citizens (including in domains such as credit, employment, education and criminal sentencing [3, 4, 26, 29]) has been accompanied by increased reports of instances in which the algorithms and models employed can be unfair or discriminatory in a variety of ways [2, 30]. As a result, research on fairness in machine learning and statistics has seen rapid growth in recent years [1, 5-7, 9-11, 13, 14, 18-21, 25, 27], and several mathematical formulations have been proposed as metrics of (un)fairness for a number of different learning frameworks. While much of the attention to date has focused on (binary) classification settings, where standard fairness notions include equal false positive or negative rates across different populations, less attention has been paid to fairness in (linear and logistic) regression settings, where the target and/or predicted values are continuous, and the same value may not occur even twice in the training data. In this work, we introduce a rich family of fairness metrics for regression models that take the form of a fairness regularizer and apply them to the standard loss functions for linear and logistic regression. Since these loss functions and our fairness regularizer are convex, the combined objective functions obtained from our framework are also convex, and thus permit efficient optimization. Furthermore, our family of fairness metrics covers the spectrum from the type of group fairness that is common in classification formulations (where e.g.
What is the Role of the Activation Function in a Neural Network?
Sorry if this is too trivial, but let me start at the "very beginning:" Linear regression. The goal of (ordinary least-squares) linear regression is to find the optimal weights that -- when linearly combined with the inputs -- result in a model that minimizes the vertical offsets between the target and explanatory variables, but let's not get distracted by model fitting, which is a different topic;). So, in linear regression, we compute a linear combination of weights and inputs (let's call this function the "net input function"). Next, let's consider logistic regression. Here, we put the net input z through a non-linear "activation function" -- the logistic sigmoid function where.
24 Uses of Statistical Modeling (Part I)
Here we discuss general applications of statistical models, whether they arise from data science, operations research, engineering, machine learning or statistics. We do not discuss specific algorithms such as decision trees, logistic regression, Bayesian modeling, Markov models, data reduction or feature selection. Instead, I discuss frameworks - each one using its own types of techniques and algorithms - to solve real life problems. Most of the entries below are found in Wikipedia, and I have used a few definitions or extracts from the relevant Wikipedia articles, in addition to personal contributions. Spatial dependency is the co-variation of properties within geographic space: characteristics at proximal locations appear to be correlated, either positively or negatively.
Is Regression Analysis Really Machine Learning?
That's a broad topic which has been treated many times. Much of what has been written on this topic is good, much is bad. But I find that the stats vs. machine learning argument, at that level, tends to focus on the forest at the cost of completely overlooking the trees. Shah's definitions, which I believe are reflective of many approaches, tend to focus on different ends of the respective spectrums of each of these concepts, treating machine learning as a practical activity and statistics as a theoretical abstraction (and, yes, I'm lumping "statistical modeling" together with "statistics" in this case... at least, for now). The relationship between statistics and machine learning is actually a highly complex one, and merely defining the 2 concepts is not helpful in dissecting this connection.
The Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square
Sur, Pragya, Chen, Yuxin, Candès, Emmanuel J.
Logistic regression is used thousands of times a day to fit data, predict future outcomes, and assess the statistical significance of explanatory variables. When used for the purpose of statistical inference, logistic models produce p-values for the regression coefficients by using an approximation to the distribution of the likelihood-ratio test. Indeed, Wilks' theorem asserts that whenever we have a fixed number $p$ of variables, twice the log-likelihood ratio (LLR) $2\Lambda$ is distributed as a $\chi^2_k$ variable in the limit of large sample sizes $n$; here, $k$ is the number of variables being tested. In this paper, we prove that when $p$ is not negligible compared to $n$, Wilks' theorem does not hold and that the chi-square approximation is grossly incorrect; in fact, this approximation produces p-values that are far too small (under the null hypothesis). Assume that $n$ and $p$ grow large in such a way that $p/n\rightarrow\kappa$ for some constant $\kappa < 1/2$. We prove that for a class of logistic models, the LLR converges to a rescaled chi-square, namely, $2\Lambda~\stackrel{\mathrm{d}}{\rightarrow}~\alpha(\kappa)\chi_k^2$, where the scaling factor $\alpha(\kappa)$ is greater than one as soon as the dimensionality ratio $\kappa$ is positive. Hence, the LLR is larger than classically assumed. For instance, when $\kappa=0.3$, $\alpha(\kappa)\approx1.5$. In general, we show how to compute the scaling factor by solving a nonlinear system of two equations with two unknowns. Our mathematical arguments are involved and use techniques from approximate message passing theory, non-asymptotic random matrix theory and convex geometry. We also complement our mathematical study by showing that the new limiting distribution is accurate for finite sample sizes. Finally, all the results from this paper extend to some other regression models such as the probit regression model.