Regression
How Fast Will You Get a Response? Predicting Interval Time for Reciprocal Link Creation
Dave, Vachik S. (Indiana University-Purdue University Indianapolis) | Hasan, Mohammad Al (Indiana University-Purdue University Indianapolis) | Reddy, Chandan K. (Virginia Polytechnic Institute and State University)
In the recent years, reciprocal link prediction has received some attention from the data mining and social network analysis researchers, who solved this problem as a binary classification task. However, it is also important to predict the interval time for the creation of reciprocal link. This is a challenging problem for two reasons: First, the lack of effective features, because well-known link prediction features are designed for undirected networks and for the binary classification task, hence they do not work well for the interval time prediction; Second, the presence of censored data instances makes the traditional supervised regression methods unsuitable for solving this problem. In this paper, we propose a solution for the reciprocal link interval time prediction task. We map this problem into survival analysis framework and show through extensive experiments on real-world datasets that, survival analysis methods perform better than traditional regression, neural network based model and support vector regression (SVR).
Predicting Breast Cancer Using Apache Spark Machine Learning Logistic Regression
In this blog post, I'll help you get started using Apache Spark's spark.ml Classification is a family of supervised machine learning algorithms that identify which category an item belongs to (for example, whether a cancer tissue observation is malignant or not), based on labeled examples of known items (for example, observations known to be malignant or not). Classification takes a set of data with known labels and pre-determined features and learns how to label new records based on that information. Features are the "if questions" that you ask. The label is the answer to those questions.
Machine Learning - 3 Things You Need to Know - MATLAB & Simulink
Supervised machine learning builds a model that makes predictions based on evidence in the presence of uncertainty. A supervised learning algorithm takes a known set of input data and known responses to the data (output) and trains a model to generate reasonable predictions for the response to new data. Use supervised learning if you have known data for the output you are trying to predict. Supervised learning uses classification and regression techniques to develop predictive models. Classification techniques predict discrete responses--for example, whether an email is genuine or spam, or whether a tumor is cancerous or benign.
Automatic Response Category Combination in Multinomial Logistic Regression
Price, Bradley S., Geyer, Charles J., Rothman, Adam J.
We propose a penalized likelihood method that simultaneously fits the multinomial logistic regression model and combines subsets of the response categories. The penalty is non differentiable when pairs of columns in the optimization variable are equal. This encourages pairwise equality of these columns in the estimator, which corresponds to response category combination. We use an alternating direction method of multipliers algorithm to compute the estimator and we discuss the algorithm's convergence. Prediction and model selection are also addressed.
Building Regression Models in R using Support Vector Regression
The article studies the advantage of Support Vector Regression (SVR) over Simple Linear Regression (SLR) models. SVR uses the same basic idea as Support Vector Machine (SVM), a classification algorithm, but applies it to predict real values rather than a class. SVR acknowledges the presence of non-linearity in the data and provides a proficient prediction model. Along with the thorough understanding of SVR, we also provide the reader with hands on experience of preparing the model on R. We perform SLR and SVR on the same dataset and make a comparison. The article is organized as follows; Section 1 provides a quick review of SLR and its implementation on R. Section 2 discusses the theoretical aspects of SVR and the steps to fit SVR on R. It also covers the basics of tuning SVR model.
Finding Bottlenecks: Predicting Student Attrition with Unsupervised Classifier
Sajjadi, Seyed, Shapiro, Bruce, McKinlay, Christopher, Sarkisyan, Allen, Shubin, Carol, Osoba, Efunwande
Policy makers, the public, university administrators, students and their families are concerned about low graduation rates and lengthy times to degree in higher education. The median time to graduation is six years at CSUN (1). The fouryear and the six-year graduation rates are 13% and 50%, respectively (2). With an enrollment of over 6000 undergraduate students, CoBaE is one of largest business schools in the nation. CoBaE confers the second most undergraduate degrees at CSUN (behind the College of Social and Behavioral Science), and it has three of the top ten most popular majors (Management, Finance, and Marketing) at CSUN.
Measuring the non-asymptotic convergence of sequential Monte Carlo samplers using probabilistic programming
Cusumano-Towner, Marco F., Mansinghka, Vikash K.
A key limitation of sampling algorithms for approximate inference is that it is difficult to quantify their approximation error. Widely used sampling schemes, such as sequential importance sampling with resampling and Metropolis-Hastings, produce output samples drawn from a distribution that may be far from the target posterior distribution. This paper shows how to upper-bound the symmetric KL divergence between the output distribution of a broad class of sequential Monte Carlo (SMC) samplers and their target posterior distributions, subject to assumptions about the accuracy of a separate gold-standard sampler. The proposed method applies to samplers that combine multiple particles, multinomial resampling, and rejuvenation kernels. The experiments show the technique being used to estimate bounds on the divergence of SMC samplers for posterior inference in a Bayesian linear regression model and a Dirichlet process mixture model. This paper builds on a growing body of work begun by [1] and [2] into estimating upper bounds on KL divergences between a sampler's output distribution and the posterior. In variational inference, the KL divergence of the variational approximation is the gap between the variational lower bound and the log-evidence.
Logistic Regression Example in Python (Source Code Included)
It's been a long time since I did a coding demonstrations so I thought I'd put one up to provide you a logistic regression example in Python! Admittedly, this is a cliff notes version, but I hope you'll get enough from what I have put up here to at least feel comfortable with the mechanics of doing logistic regression in Python (more specifically; using scikit-learn, pandas, etcโฆ). This logistic regression example in Python will be to predict passenger survival using the titanic dataset from Kaggle. Before launching into the code though, let me give you a tiny bit of theory behind logistic regression. The logistic regression formula is derived from the standard linear equation for a straight line.
Linear Regression with Shuffled Labels
Abid, Abubakar, Poon, Ada, Zou, James
Since at least the 19th century, linear regression has been widely used in statistics to infer the relationship between one more explanatory variables (or input features) and a continuous dependent variable (or label) [1, 2]. In the classical setting, linear regression is used on supervised datasets that are fully and individually labeled. Not all data fit this criterion, so, in recent years, the question of inference from weakly-supervised datasets has drawn attention in the machine learning community [3, 4, 5]. In weakly-supervised datasets, data are neither entirely labeled nor entirely unlabeled; a subset of the data may be labeled, as is the case in semi-supervised learning, or the data may be implicitly labeled, as occurs, for example, in multi-instance learning [6, 7]. Weakly-supervised datasets naturally arise in situations where obtaining labels for individual data is expensive or difficult; often times, it is significantly easier to conduct experiments that provide partial information. In this paper, we study one specific case of weakly-supervised data: shuffled data, in which all of the labels are observed, but the mutual ordering between the input features and the labels is unknown. Shuffled linear regression, then, can be described as a variant of traditional linear regression in which the labels are additionally perturbed by an unknown permutation.
Efficient Spatio-Temporal Gaussian Regression via Kalman Filtering
Todescato, Marco, Carron, Andrea, Carli, Ruggero, Pillonetto, Gianluigi, Schenato, Luca
In this work we study the non-parametric reconstruction of spatio-temporal dynamical Gaussian processes (GPs) via GP regression from sparse and noisy data. GPs have been mainly applied to spatial regression where they represent one of the most powerful estimation approaches also thanks to their universal representing properties. Their extension to dynamical processes has been instead elusive so far since classical implementations lead to unscalable algorithms. We then propose a novel procedure to address this problem by coupling GP regression and Kalman filtering. In particular, assuming space/time separability of the covariance (kernel) of the process and rational time spectrum, we build a finite-dimensional discrete-time state-space process representation amenable of Kalman filtering. With sampling over a finite set of fixed spatial locations, our major finding is that the Kalman filter state at instant $t_k$ represents a sufficient statistic to compute the minimum variance estimate of the process at any $t \geq t_k$ over the entire spatial domain. This result can be interpreted as a novel Kalman representer theorem for dynamical GPs. We then extend the study to situations where the set of spatial input locations can vary over time. The proposed algorithms are finally tested on both synthetic and real field data, also providing comparisons with standard GP and truncated GP regression techniques.