Regression
Boosted Sparse and Low-Rank Tensor Regression
He, Lifang, Chen, Kun, Xu, Wanwan, Zhou, Jiayu, Wang, Fei
We propose a sparse and low-rank tensor regression model to relate a univariate outcome to a feature tensor, in which each unit-rank tensor from the CP decomposition of the coefficient tensor is assumed to be sparse. This structure is both parsimonious and highly interpretable, as it implies that the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions. We take a divide-and-conquer strategy to simplify the task into a set of sparse unit-rank tensor regression problems. To make the computation efficient and scalable, for the unit-rank tensor regression, we propose a stagewise estimation procedure to efficiently trace out its entire solution path. We show that as the step size goes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized regression. The superior performance of our approach is demonstrated on various real-world and synthetic examples.
Snap ML: A Hierarchical Framework for Machine Learning
Dรผnner, Celestine, Parnell, Thomas, Sarigiannis, Dimitrios, Ioannou, Nikolas, Anghel, Andreea, Ravi, Gummadi, Kandasamy, Madhusudanan, Pozidis, Haralampos
We describe a new software framework for fast training of generalized linear models. The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern computing systems. We prove theoretically that such a hierarchical system can accelerate training in distributed environments where intra-node communication is cheaper than inter-node communication. Additionally, we provide a review of the implementation of Snap ML in terms of GPU acceleration, pipelining, communication patterns and software architecture, highlighting aspects that were critical for achieving high performance. We evaluate the performance of Snap ML in both single-node and multi-node environments, quantifying the benefit of the hierarchical scheme and the data streaming functionality, and comparing with other widely-used machine learning software frameworks. Finally, we present a logistic regression benchmark on the Criteo Terabyte Click Logs dataset and show that Snap ML achieves the same test loss an order of magnitude faster than any of the previously reported results, including those obtained using TensorFlow and scikit-learn.
Linear Regression in Python โ Venkatesh Prabhu โ Medium
How great it would be if you were able to predict the number of cars your company might sell next year? How profitable you might be if you were able to predict the stock prices of a brand in advance so that you can invest without any risks? How awesome it would be if you were able to predict your salary for the next 5 years? How cool it would be if you were able to predict the score for your favourite game? All of these are possible with just one simple algorithm called Linear Regression.
Supervised Learning: Model Popularity from Past to Present
The field of machine learning has gone through enormous changes in the last decades. Admittedly, there are some methods that have been around for a long time and are still a staple of the field. For example, the concept of least squares was already proposed in the early 19th century by Legendre and Gauss. Other approaches such as neural networks, whose most basic form was introduced in 1958, were substantially advanced in the last decades, while other methods such as support vector machines (SVMs) are even more recent. Due to the large number of available approaches for supervised learning, the following question is often posed: What is the best model?
Predicting with Proxies
Predictive analytics is increasingly used to guide decision-making in many applications. However, in practice, we often have limited data on the true predictive task of interest, but copious data on a closely-related proxy predictive task. Practitioners often train predictive models on proxies since it achieves more accurate predictions. For example, e-commerce platforms use abundant customer click data (proxy) to make product recommendations rather than the relatively sparse customer purchase data (true outcome of interest); alternatively, hospitals often rely on medical risk scores trained on a different patient population (proxy) rather than their own patient population (true cohort of interest) to assign interventions. However, not accounting for the bias in the proxy can lead to sub-optimal decisions. Using real datasets, we find that this bias can often be captured by a sparse function of the features. Thus, we propose a novel two-step estimator that uses techniques from high-dimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several heuristics commonly used by data scientists; in particular, our proposed estimator can achieve the same accuracy with exponentially less true data (in the number of features $d$). Our proof relies on a new tail inequality on the convergence of LASSO for approximately sparse vectors. Finally, we demonstrate the effectiveness of our approach on e-commerce and healthcare datasets; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data.
On the Interaction Effects Between Prediction and Clustering
Barnes, Matt, Dubrawski, Artur
Machine learning systems increasingly depend on pipelines of multiple algorithms to provide high quality and well structured predictions. This paper argues interaction effects between clustering and prediction (e.g. classification, regression) algorithms can cause subtle adverse behaviors during cross-validation that may not be initially apparent. In particular, we focus on the problem of estimating the out-of-cluster (OOC) prediction loss given an approximate clustering with probabilistic error rate $p_0$. Traditional cross-validation techniques exhibit significant empirical bias in this setting, and the few attempts to estimate and correct for these effects are intractable on larger datasets. Further, no previous work has been able to characterize the conditions under which these empirical effects occur, and if they do, what properties they have. We precisely answer these questions by providing theoretical properties which hold in various settings, and prove that expected out-of-cluster loss behavior rapidly decays with even minor clustering errors. Fortunately, we are able to leverage these same properties to construct hypothesis tests and scalable estimators necessary for correcting the problem. Empirical results on benchmark datasets validate our theoretical results and demonstrate how scaling techniques provide solutions to new classes of problems.
Visualizing and assessing discrimination in the logistic regression model. - PubMed - NCBI
Logistic regression models are widely used in medicine for predicting patient outcome (prognosis) and constructing diagnostic tests (diagnosis). Multivariable logistic models yield an (approximately) continuous risk score, a transformation of which gives the estimated event probability for an individual. A key aspect of model performance is discrimination, that is, the model's ability to distinguish between patients who have (or will have) an event of interest and those who do not (or will not). Graphical aids are important in understanding a logistic model. The receiver-operating characteristic (ROC) curve is familiar, but not necessarily easy to interpret. We advocate a simple graphic that provides further insight into discrimination, namely a histogram or dot plot of the risk score in the outcome groups.
Multi-task Prediction of Patient Workload
Olya, Mohammad Hessam, Zhu, Dongxiao, Yang, Kai
Developing reliable workload predictive models can affect many aspects of clinical decision making procedure. The primary challenge in healthcare systems is handling the demand uncertainty over the time. This issue becomes more critical for the healthcare facilities that provide service for chronic disease treatment because of the need for continuous treatments over the time. Although some researchers focused on exploring the methods for workload prediction recently, few types of research mainly focused on forecasting a quantitative measure for the workload of healthcare providers. Also, among the mentioned studies most of them just focused on workload prediction within one facility. The drawback of the previous studies is the problem is not investigated for multiple facilities where the quality of provided service, the equipment, and resources used for provided service as well as the diagnosis and treatment procedures may differ even for patients with similar conditions. To tackle the mentioned issue, this paper suggests a framework for patient workload prediction by using patients data from VA facilities across the US. To capture the information of patients with similar attributes and make the prediction more accurate, a heuristic cluster based algorithm for single task learning as well as a multi task learning approach are developed in this research.
A Descriptive Study of Variable Discretization and Cost-Sensitive Logistic Regression on Imbalanced Credit Data
Zhang, Lili, Ray, Herman, Tan, Soon
Training classification models on imbalanced data sets tends to result in bias towards the majority class. In this paper, we demonstrate how the variable discretization and Cost-Sensitive Logistic Regression help mitigate this bias on an imbalanced credit scoring data set. 10-fold cross-validation is used as the evaluation method, and the performance measurements are ROC curves and the associated Area Under the Curve. The results show that good variable discretization and Cost-Sensitive Logistic Regression with the best class weight can reduce the model bias and/or variance. It is also shown that effective variable selection helps reduce the model variance. From the algorithm perspective, Cost-Sensitive Logistic Regression is beneficial for increasing the prediction ability of predictors even if they are not in their best forms and keeping the multivariate effect and univariate effect of predictors consistent. From the predictors perspective, the variable discretization performs slightly better than Cost-Sensitive Logistic Regression, provides more reasonable coefficient estimates for predictors which have nonlinear relationship against their empirical logit, and is robust to penalty weights of misclassifications of events and non-events determined by their proportions.
Classification of radiology reports by modality and anatomy: A comparative study
Bendersky, Marina, Wu, Joy, Syeda-Mahmood, Tanveer
Abstract--Data labeling is currently a time-consuming task that often requires expert knowledge. In research settings, the availability of correctly labeled data is crucial to ensure that model predictions are accurate and useful. We propose relatively simplemachine learning-based models that achieve high performance metrics in the binary and multiclass classification of radiology reports. We compare the performance of these algorithms to that of a data-driven approach based on NLP, and find that the logistic regression classifier outperforms all other models, in both the binary and multiclass classification tasks. We then choose the logistic regression binary classifier to predict chest X-ray (CXR)/ non-chest X-ray (non-CXR) labels in reports from different datasets, unseen during any training phase of any of the models. Even in unseen report collections, the binary logistic regression classifier achieves average precision values of above 0.9. Based on the regression coefficient values, we also identify frequent tokens in CXR and non-CXR reports that are features with possibly high predictive power. I. INTRODUCTION Large data collections that can be comprised of text, images oreven video, are becoming more easily available to researchers, clinicians and the public in general. It is quite often necessary, as a critical initial step, to mine input data before proceeding to further research or analysis. In a research setting, careful and accurate data labeling can be a tedious and time-consuming task that often requires manual inputs and expert knowledge. Moreover, the same dataset might need to be relabeled multiple times, not only in cases where the same dataset is used for different research purposes but also in cases where the data is mislabeled. Mislabeled data [1] produces in itself at least 2 new problems; first,the mislabeled data needs to be identified and differentiated from correctly labeled data [1, 2], and second, the mislabeled data should be corrected or removed from the dataset (if possible).