Regression
7 Types of Regression Techniques you should know
This article was posted by Sunil Ray. Sunil is a Business Analytics and Intelligence professional with deep experience in the Indian Insurance industry. Linear and Logistic regressions are usually the first algorithms people learn in predictive modeling. Due to their popularity, a lot of analysts even end up thinking that they are the only form of regressions. The ones who are slightly more involved think that they are the most important amongst all forms of regression analysis.
Distributed linear regression by averaging
Modern massive datasets pose an enormous computational burden to practitioners. Distributed computation has emerged as a universal approach to ease the burden: Datasets are partitioned over machines, which compute locally, and communicate short messages. Distributed data also arises due to privacy reasons, such as in medicine. It is important to study how to do statistical inference and machine learning in a distributed setting. In this paper, we study one-step parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, and take a weighted average of the parameters. How much do we lose compared to doing linear regression on the full data? Here we study the performance loss in estimation error, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We discover several key phenomena. First, averaging is not optimal, and we find the exact performance loss. Our results are simple to use in practice. Second, different problems are affected differently by the distributed framework. Estimation error and confidence interval length increases a lot, while prediction error increases much less. These results match simulations and a data analysis example. We rely on recent results from random matrix theory, where we develop a new calculus of deterministic equivalents as a tool of broader interest.
Generalization Properties of hyper-RKHS and its Application to Out-of-Sample Extensions
Liu, Fanghui, Shi, Lei, Huang, Xiaolin, Yang, Jie, Suykens, Johan A. K.
Hyper-kernels endowed by hyper-Reproducing Kernel Hilbert Space (hyper-RKHS) formulate the kernel learning task as learning on the space of kernels itself, which provides significant model flexibility for kernel learning with outstanding performance in real-world applications. However, the convergence behavior of these learning algorithms in hyper-RKHS has not been investigated in learning theory. In this paper, we conduct approximation analysis of kernel ridge regression (KRR) and support vector regression (SVR) in this space. To the best of our knowledge, this is the first work to study the approximation performance of regression in hyper-RKHS. For applications, we propose a general kernel learning framework conducted by the introduced two regression models to deal with the out-of-sample extensions problem, i.e., to learn a underlying general kernel from the pre-given kernel/similarity matrix in hyper-RKHS. Experimental results on several benchmark datasets suggest that our methods are able to learn a general kernel function from an arbitrary given kernel matrix.
Understanding Deep Neural Networks from First Principles: Logistic Regression
The advanced feats we've seen machines do thus far have basically been examples of clever optimization techniques). So what does this learning process look like? First, weight and bias values are propagated forward through the model to arrive at a predicted output. At each neuron/node, the linear combination of the inputs is then multiplied by an activation function as described above-- the sigmoid function in our example. This process by which weights and biases are propagated from inputs to output is called forward propagation. After arriving at the predicted output, the loss for the training example is calculated.
Where did the least-square come from? โ Towards Data Science
Question: Why do you square the error in a regression machine learning task? Ans: "Why, of course, it turns out all the errors (residuals) into positive quantities!" Question: "OK, why not use a simpler absolute value function x to make all the errors positive?" Ans: "Aha, you are trying to trick me. Absolute value function is not differentiable everywhere!" Question: "That should not matter much for numerical algorithms. LASSO regression uses a term with absolute value and it can be handled.
A Gradient-Based Split Criterion for Highly Accurate and Transparent Model Trees
Broelemann, Klaus, Kasneci, Gjergji
Machine learning algorithms aim at minimizing the number of false decisions and increasing the accuracy of predictions. However, the high predictive power of advanced algorithms comes at the costs of transparency. State-of-the-art methods, such as neural networks and ensemble methods, often result in highly complex models that offer little transparency. We propose shallow model trees as a way to combine simple and highly transparent predictive models for higher predictive power without losing the transparency of the original models. We present a novel split criterion for model trees that allows for significantly higher predictive power than state-of-the-art model trees while maintaining the same level of simplicity. This novel approach finds split points which allow the underlying simple models to make better predictions on the corresponding data. In addition, we introduce multiple mechanisms to increase the transparency of the resulting trees.
Longitudinal data analysis using matrix completion
Kidziลski, ลukasz, Hastie, Trevor
In clinical practice and biomedical research, measurements are often collected sparsely and irregularly in time while the data acquisition is expensive and inconvenient. Examples include measurements of spine bone mineral density, cancer growth through mammography or biopsy, a progression of defect of vision, or assessment of gait in patients with neurological disorders. Since the data collection is often costly and inconvenient, estimation of progression from sparse observations is of great interest for practitioners. From the statistical standpoint, such data is often analyzed in the context of a mixed-effect model where time is treated as both random and fixed effect. Alternatively, researchers analyze Gaussian processes or functional data where observations are assumed to be drawn from a certain distribution of processes. These models are flexible but rely on probabilistic assumptions and require very careful implementation. In this study, we propose an alternative elementary framework for analyzing longitudinal data, relying on matrix completion. Our method yields point estimates of progression curves by iterative application of the SVD. Our framework covers multivariate longitudinal data, regression and can be easily extended to other settings. We apply our methods to understand trends of progression of motor impairment in children with Cerebral Palsy. Our model approximates individual progression curves and explains 30% of the variability. Low-rank representation of progression trends enables discovering that subtypes of Cerebral Palsy exhibit different progression trends.
Regression vs Curve Fitting - Technical Diversity in Data Science Teams
Modern enterprise data science teams are technically diverse. The members of these teams differ in their education levels (bachelors, masters, PhD), majors (CS, statistics, natural sciences), prior work experiences (advertising, actuaries, finance, experimental physics) and used tools (Excel, SAS, R, Python, Java). One is more likely to find biology or computational neuroscience majors in data science teams today than in actuarial or financial firms of the past. But there is a more subtle difference than the ones listed above: team members with quantitative backgrounds may differ in exposure to the same set of concepts. Take the example of regression. Both engineering and statistics departments devote a portion of their curriculum to teaching line fitting.
Refining Coarse-grained Spatial Data using Auxiliary Spatial Data Sets with Various Granularities
Tanaka, Yusuke, Iwata, Tomoharu, Tanaka, Toshiyuki, Kurashima, Takeshi, Okawa, Maya, Toda, Hiroyuki
We propose a probabilistic model for refining coarse-grained spatial data by utilizing auxiliary spatial data sets. Existing methods require that the spatial granularities of the auxiliary data sets are the same as the desired granularity of target data. The proposed model can effectively make use of auxiliary data sets with various granularities by hierarchically incorporating Gaussian processes. With the proposed model, a distribution for each auxiliary data set on the continuous space is modeled using a Gaussian process, where the representation of uncertainty considers the levels of granularity. The fine-grained target data are modeled by another Gaussian process that considers both the spatial correlation and the auxiliary data sets with their uncertainty. We integrate the Gaussian process with a spatial aggregation process that transforms the fine-grained target data into the coarse-grained target data, by which we can infer the fine-grained target Gaussian process from the coarse-grained data. Our model is designed such that the inference of model parameters based on the exact marginal likelihood is possible, in which the variables of fine-grained target and auxiliary data are analytically integrated out. Our experiments on real-world spatial data sets demonstrate the effectiveness of the proposed model.
Opacity, Obscurity, and the Geometry of Question-Asking
Boyce-Jacino, Christina, DeDeo, Simon
Asking questions is a pervasive human activity, but little is understood about what makes them difficult to answer. An analysis of a pair of large databases, of New York Times crosswords and questions from the quiz-show Jeopardy, establishes two orthogonal dimensions of question difficulty: obscurity (the rarity of the answer) and opacity (the indirectness of question cues, operationalized with word2vec). The importance of opacity, and the role of synergistic information in resolving it, suggests that accounts of difficulty in terms of prior expectations captures only a part of the question-asking process. A further regression analysis shows the presence of additional dimensions to question-asking: question complexity, the answer's local network density, cue intersection, and the presence of signal words. Our work shows how question-askers can help their interlocutors by using contextual cues, or, conversely, how a particular kind of unfamiliarity with the domain in question can make it harder for individuals to learn from others. Taken together, these results suggest how Bayesian models of question difficulty can be supplemented by process models and accounts of the heuristics individuals use to navigate conceptual spaces.