Regression
An MM Algorithm for Split Feasibility Problems
Xu, Jason, Chi, Eric C., Yang, Meng, Lange, Kenneth
The classical multi-set split feasibility problem seeks a point in the intersection of finitely many closed convex domain constraints, whose image under a linear mapping also lies in the intersection of finitely many closed convex range constraints. Split feasibility generalizes important inverse problems including convex feasibility, linear complementarity, and regression with constraint sets. When a feasible point does not exist, solution methods that proceed by minimizing a proximity function can be used to obtain optimal approximate solutions to the problem. We present an extension of the proximity function approach that generalizes the linear split feasibility problem to allow for non-linear mappings. Our algorithm is based on the principle of majorization-minimization, is amenable to quasi-Newton acceleration, and comes complete with convergence guarantees under mild assumptions. Furthermore, we show that the Euclidean norm appearing in the proximity function of the non-linear split feasibility problem can be replaced by arbitrary Bregman divergences. We explore several examples illustrating the merits of non-linear formulations over the linear case, with a focus on optimization for intensity-modulated radiation therapy.
TensorFlow Machine Learning Cookbook PACKT Books
TensorFlow is an open source software library for Machine Intelligence. The independent recipes in this book will teach you how to use TensorFlow for complex data computations and will help you gain more insights into your data than ever before. We'll start with the fundamentals of the TensorFlow library and you will learn about variables, matrices, and various data sources. Moving ahead, you will get hands-on experience of Linear Regression techniques with TensorFlow. The next chapters cover important high-level concepts such as neural networks, CNN, RNN, and NLP through real-world examples in every recipe.
Factor Analysis: Picking the Right Variables
In layman's terms, it means choosing which factors (variables) in a data set you should use for your model. In the above example, the columns (highlighted in light orange) would be our Factors. It can be very tempting, especially for new data science students, to want to include as many factors as possible. In fact, as you add more factors to a model, you will see many classic statistical markers for model goodness increase. This can give you a false sense of trust in the model.
Predicting patient 'cost blooms' in Denmark: a longitudinal population-based study
A small fraction of individuals account for the bulk of population healthcare expenditures in the USA, Denmark and other industrialised countries.1–4 Although many high-cost patients show consecutive high-cost years, the majority experience a'cost bloom', or a surge in healthcare costs that propels them from a lower to the upper decile of population-level healthcare expenditures between consecutive years.4 Proactively identifying and managing care for high-cost patients--especially cost bloomers, who may disproportionately benefit from interventions to mitigate future high-cost years--can be an effective way to simultaneously improve quality and reduce population health costs.5–16 However, since the Centers for Medicare and Services (CMS) commissioned the Society of Actuaries to compare leading prediction tools more than 10 years ago, scant progress has been made in improving cost-prediction tools.17 Overcoming these and other challenges associated with the management and care of high-cost patients is essential to achieving a higher value healthcare system.
Multivariate Regression with Grossly Corrupted Observations: A Robust Approach and its Applications
Zhang, Xiaowei, Xu, Chi, Zhang, Yu, Zhu, Tingshao, Cheng, Li
This paper studies the problem of multivariate linear regression where a portion of the observations is grossly corrupted or is missing, and the magnitudes and locations of such occurrences are unknown in priori. To deal with this problem, we propose a new approach by explicitly consider the error source as well as its sparseness nature. An interesting property of our approach lies in its ability of allowing individual regression output elements or tasks to possess their unique noise levels. Moreover, despite working with a non-smooth optimization problem, our approach still guarantees to converge to its optimal solution. Experiments on synthetic data demonstrate the competitiveness of our approach compared with existing multivariate regression models. In addition, empirically our approach has been validated with very promising results on two exemplar real-world applications: The first concerns the prediction of \textit{Big-Five} personality based on user behaviors at social network sites (SNSs), while the second is 3D human hand pose estimation from depth images. The implementation of our approach and comparison methods as well as the involved datasets are made publicly available in support of the open-source and reproducible research initiatives.
How to forecast using Regression Analysis in R
P-values for coefficients of cylinders, horsepower and acceleration are all greater than 0.05. This means that the relationship between the dependent and these independent variables is not significant at the 95% certainty level. I'll drop 2 of these variables and try again. High p-values for these independent variables do not mean that they definitely should not be used in the model. It could be that some other variables are correlated with these variables and making these variables less useful for prediction (check Multicollinearity).
Coupled Compound Poisson Factorization
Basbug, Mehmet E., Engelhardt, Barbara E.
We present a general framework, the coupled compound Poisson factorization (CCPF), to capture the missing-data mechanism in extremely sparse data sets by coupling a hierarchical Poisson factorization with an arbitrary data-generating model. We derive a stochastic variational inference algorithm for the resulting model and, as examples of our framework, implement three different data-generating models---a mixture model, linear regression, and factor analysis---to robustly model non-random missing data in the context of clustering, prediction, and matrix factorization. In all three cases, we test our framework against models that ignore the missing-data mechanism on large scale studies with non-random missing data, and we show that explicitly modeling the missing-data mechanism substantially improves the quality of the results, as measured using data log likelihood on a held-out test set.
ŷhat Five Common Applications of Data Science with Concrete, Real-Life Use Cases
In this whitepaper we introduce five common applications of data science that build upon that definition and goal. We debunk the impression that data science is some type of obscure black magic and give you concrete examples of how it is applied in reality. You'll learn how real companies are using data science to make their products and day- to-day operations better. Last but not least, we describe the data science life cycle and explain Yhat's role in getting models into production. Recommender systems, also known as recommender engines, are one of the most well known applications of data science.
Wavelet Scattering Regression of Quantum Chemical Energies
Hirn, Matthew, Mallat, Stéphane, Poilvert, Nicolas
We introduce multiscale invariant dictionaries to estimate quantum chemical energies of organic molecules, from training databases. Molecular energies are invariant to isometric atomic displacements, and are Lipschitz continuous to molecular deformations. Similarly to density functional theory (DFT), the molecule is represented by an electronic density function. A multiscale invariant dictionary is calculated with wavelet scattering invariants. It cascades a first wavelet transform which separates scales, with a second wavelet transform which computes interactions across scales. Sparse scattering regressions give state of the art results over two databases of organic planar molecules. On these databases, the regression error is of the order of the error produced by DFT codes, but at a fraction of the computational cost.
Shehroz Khan's answer to Is it possible to compute R-squared score in Weka for logistic regression? - Quora
R-squared score is computed for regression problems. Logistic regression, as the name suggests, is not regression but binary classification problem. Therefore, R-squared statistics cannot be computed for logistic regression. Other performance metrics, such as, accuracy, precision, recall etc are more relevant in this context. To answer your question - No R-squared score is not a valid metric for logistic regression, be it using Weka or any other ML library or even your own algorithm.