Regression
A Conditional Randomization Test for Sparse Logistic Regression in High-Dimension
Nguyen, Binh T., Thirion, Bertrand, Arlot, Sylvain
Identifying the relevant variables for a classification model with correct confidence levels is a central but difficult task in high-dimension. Despite the core role of sparse logistic regression in statistics and machine learning, it still lacks a good solution for accurate inference in the regime where the number of features $p$ is as large as or larger than the number of samples $n$. Here, we tackle this problem by improving the Conditional Randomization Test (CRT). The original CRT algorithm shows promise as a way to output p-values while making few assumptions on the distribution of the test statistics. As it comes with a prohibitive computational cost even in mildly high-dimensional problems, faster solutions based on distillation have been proposed. Yet, they rely on unrealistic hypotheses and result in low-power solutions. To improve this, we propose \emph{CRT-logit}, an algorithm that combines a variable-distillation step and a decorrelation step that takes into account the geometry of $\ell_1$-penalized logistic regression problem. We provide a theoretical analysis of this procedure, and demonstrate its effectiveness on simulations, along with experiments on large-scale brain-imaging and genomics datasets.
Modeling the dynamics of language change: logistic regression, Piotrowski's law, and a handful of examples in Polish
Gรณrski, Rafaล L., Eder, Maciej
The study discusses modeling diachronic processes by logistic regression. The phenomenon of nonlinear changes in language was first observed by Raimund Piotrowski (hence labelled as Piotrowski's law), even if actual linguistic evidence usually speaks against using the notion of a "law" in this context. In our study, we apply logistic regression models to 9 changes which occurred between 15th and 18th century in the Polish language. The attested course of the majority of these changes closely follow the expected values, which proves that the language change might indeed resemble a nonlinear phase change scenario. We also extend the original Piotrowski's approach by proposing polynomial logistic regression for these cases which can hardly be described by its standard version. Also, we propose to consider individual language change cases jointly, in order to inspect their possible collinearity or, more likely, their different dynamics in the function of time. Last but not least, we evaluate our results by testing the influence of the subcorpus size on the model's goodness-of-fit.
Multi-output machine learning models for kinetic data evaluation : A FischerโTropsch synthesis case study
Machine learning model like Lasso regression is not sufficient for multi-output Fischer Tropsch synthesis prediction. Artificial Neural Network regression able to capture complex non-linearity in Fischer Tropsch synthesis kinetic data. Shap interpretation technique finds process variable ranking in model prediction. Predicting the impact of input process variables on chemical processes is key to assess their performance of the latter. Models explaining this impact through a mechanistic approach are rarely readily available, complex in nature and/or require long development time.
Theoretically Accurate Regularization Technique for Matrix Factorization based Recommender Systems
Regularization is a popular technique to solve the overfitting problem of machine learning algorithms. Most regularization technique relies on parameter selection of the regularization coefficient. Plug-in method and cross-validation approach are two most common parameter selection approaches for regression methods such as Ridge Regression, Lasso Regression and Kernel Regression. Matrix factorization based recommendation system also has heavy reliance on the regularization technique. In this paper, we prove that such approach of selecting regularization coefficient is invalid, and we provide a theoretically accurate method that outperforms the most widely used approach in both accuracy and fairness metrics.
On Learning Mixture of Linear Regressions in the Non-Realizable Setting
Ghosh, Avishek, Mazumdar, Arya, Pal, Soumyabrata, Sen, Rajat
While mixture of linear regressions (MLR) is a well-studied topic, prior works usually do not analyze such models for prediction error. In fact, {\em prediction} and {\em loss} are not well-defined in the context of mixtures. In this paper, first we show that MLR can be used for prediction where instead of predicting a label, the model predicts a list of values (also known as {\em list-decoding}). The list size is equal to the number of components in the mixture, and the loss function is defined to be minimum among the losses resulted by all the component models. We show that with this definition, a solution of the empirical risk minimization (ERM) achieves small probability of prediction error. This begs for an algorithm to minimize the empirical risk for MLR, which is known to be computationally hard. Prior algorithmic works in MLR focus on the {\em realizable} setting, i.e., recovery of parameters when data is probabilistically generated by a mixed linear (noisy) model. In this paper we show that a version of the popular alternating minimization (AM) algorithm finds the best fit lines in a dataset even when a realizable model is not assumed, under some regularity conditions on the dataset and the initial points, and thereby provides a solution for the ERM. We further provide an algorithm that runs in polynomial time in the number of datapoints, and recovers a good approximation of the best fit lines. The two algorithms are experimentally compared.
Day 15โ60 days of Data Science and Machine Learning
Hope you all had a great Halloween weekend [ I dressed up as "Mother of Dragons" along with my cool " Game of thrones" techie friends];) #winteriscoming. Let's get back and learn some more data science and machine learning. I hope you all have already grasped the Python essentials, Statistics and Maths from day 1 -- day 8(links shared below), Pandas part 1 and part 2 on Day 9, Day 10, Numpy as Day 11, Data Preprocessing Part 1 as Day 12, Data Preprocessing part 2 as Day 13th, Hands on Regression Part 1 as Day 14th. In this post we will cover how we can implement Regression -- part 2 as Day 15. The Linear Regression method is basically a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) as it just minimizes the least squares error: for one object target y x T * w, where w is model's weights.
Development and internal validation of a machine-learning-developed model for predicting 1-year mortality after fragility hip fracture - BMC Geriatrics
Fragility hip fracture increases morbidity and mortality in older adult patients, especially within the first year. Identification of patients at high risk of death facilitates modification of associated perioperative factors that can reduce mortality. Various machine learning algorithms have been developed and are widely used in healthcare research, particularly for mortality prediction. This study aimed to develop and internally validate 7 machine learning models to predict 1-year mortality after fragility hip fracture. This retrospective study included patients with fragility hip fractures from a single center (Siriraj Hospital, Bangkok, Thailand) from July 2016 to October 2018. A total of 492 patients were enrolled. They were randomly categorized into a training group (344 cases, 70%) or a testing group (148 cases, 30%). Various machine learning techniques were used: the Gradient Boosting Classifier (GB), Random Forests Classifier (RF), Artificial Neural Network Classifier (ANN), Logistic Regression Classifier (LR), Naive Bayes Classifier (NB), Support Vector Machine Classifier (SVM), and K-Nearest Neighbors Classifier (KNN). All models were internally validated by evaluating their performance and the area under a receiver operating characteristic curve (AUC). For the testing dataset, the accuracies were GB modelโ=โ0.93, RF modelโ=โ0.95, ANN modelโ=โ0.94, LR modelโ=โ0.91, NB modelโ=โ0.89, SVM modelโ=โ0.90, and KNN modelโ=โ0.90. All models achieved high AUCs that ranged between 0.81 and 0.99. The RF model also provided a negative predictive value of 0.96, a positive predictive value of 0.93, a specificity of 0.99, and a sensitivity of 0.68. Our machine learning approach facilitated the successful development of an accurate model to predict 1-year mortality after fragility hip fracture. Several machine learning algorithms (eg, Gradient Boosting and Random Forest) had the potential to provide high predictive performance based on the clinical parameters of each patient. The web application is available at www.hipprediction.com . External validation in a larger group of patients or in different hospital settings is warranted to evaluate the clinical utility of this tool. Thai Clinical Trials Registry (22 February 2021; reg. no. TCTR20210222003 ).
Know About Ensemble Methods in Machine Learning - Analytics Vidhya
This article was published as a part of the Data Science Blogathon. The variance is the difference between the model and the ground truth value, whereas the error is the outcome of sensitivity to tiny perturbations in the training set. Excessive bias might cause an algorithm to miss unique relationships between the intended outputs and the features (underfitting). There is a high variance in the algorithm that models random noise in the training data (overfitting). The bias-variance tradeoff is a characteristic of a model that states to lower the bias in estimated parameters, the variance of the parameter estimated across samples has increased.
Factorized Structured Regression for Large-Scale Varying Coefficient Models
Rรผgamer, David, Bender, Andreas, Wiegrebe, Simon, Racek, Daniel, Bischl, Bernd, Mรผller, Christian L., Stachl, Clemens
Recommender Systems (RS) pervade many aspects of our everyday digital life. Proposed to work at scale, state-of-the-art RS allow the modeling of thousands of interactions and facilitate highly individualized recommendations. Conceptually, many RS can be viewed as instances of statistical regression models that incorporate complex feature effects and potentially non-Gaussian outcomes. Such structured regression models, including time-aware varying coefficients models, are, however, limited in their applicability to categorical effects and inclusion of a large number of interactions. Here, we propose Factorized Structured Regression (FaStR) for scalable varying coefficient models. FaStR overcomes limitations of general regression models for large-scale data by combining structured additive regression and factorization approaches in a neural network-based model implementation. This fusion provides a scalable framework for the estimation of statistical models in previously infeasible data settings. Empirical results confirm that the estimation of varying coefficients of our approach is on par with state-of-the-art regression techniques, while scaling notably better and also being competitive with other time-aware RS in terms of prediction performance. We illustrate FaStR's performance and interpretability on a large-scale behavioral study with smartphone user data.
Research Papers based on Lasso Regression part2(Machine Learning)
Abstract: The application of the lasso is espoused in high-dimensional settings where only a small number of the regression coefficients are believed to be nonzero. Moreover, statistical properties of high-dimensional lasso estimators are often proved under the assumption that the correlation between the predictors is bounded. In this vein, coordinatewise methods, the most common means of computing the lasso solution, work well in the presence of low to moderate multicollinearity. Motivated by these limitations, we propose the novel "Deterministic Bayesian Lasso" algorithm for computing the lasso solution. This algorithm is developed by considering a limiting version of the Bayesian lasso.