Regression
2022 Machine Learning A to Z : 5 Machine Learning Projects
Evaluation metrics to analyze the performance of models Industry relevance of linear and logistic regression Mathematics behind KNN, SVM and Naive Bayes algorithms Implementation of KNN, SVM and Naive Bayes using sklearn Attribute selection methods- Gini Index and Entropy Mathematics behind Decision trees and random forest Boosting algorithms:- Adaboost, Gradient Boosting and XgBoost Different Algorithms for Clustering Different methods to deal with imbalanced data Correlation Filtering Content and Collaborative based filtering Singular Value Decomposition Different algorithms used for Time Series forecasting Hands on Real-World examples. To make sense out of this course, you should be well aware of linear algebra, calculus, statistics, probability and python programming language. To make sense out of this course, you should be well aware of linear algebra, calculus, statistics, probability and python programming language. This course is a perfect fit for you. This course will take you step by step into the world of Machine Learning.
Differentially Private Regression with Unbounded Covariates
Milionis, Jason, Kalavasis, Alkis, Fotakis, Dimitris, Ioannidis, Stratis
Ever since the introduction of Differential Privacy (DP) by Dwork et al. (2006), differentially private variants of statistical estimation procedures have been a research topic of intense interest. The work on learning linear models alone is vast (see Cai et al. (2020); Wang (2018) for two recent reviews). Empirical Risk Minimization is also the impetus for the development of a broad array of new methods for DP-mechanism design, including output perturbation (Iyengar et al., 2019; Zhang et al., 2017; Jain and Thakurta, 2014), objective perturbation (Chaudhuri et al., 2011; Kifer et al., 2012), and gradient perturbation (Bassily et al., 2014; Abadi et al., 2016), to name a few. Nevertheless, despite the intense interest on this topic, all of the existing work on regression provides differential-privacy guarantees assuming bounded covariates. Intuitively, this can be explained by inspecting even the simple least squares estimator used in linear regression. It is easy to see that estimator's sensitivity, i.e., its variability under changes on a single sample, is determined by the design matrix (i.e., the matrix of samples). As sensitivity has a direct effect on differential privacy guarantees, bounding the design matrix's eigenvalues is the prevalent approach for bounding the sensitivity. For this reason, assuming bounded covariates is a ubiquitous assumption in DP literature on both linear regression and learning generalized linear models.
Denoising and change point localisation in piecewise-constant high-dimensional regression coefficients
Wang, Fan, Padilla, Oscar Hernan Madrid, Yu, Yi, Rinaldo, Alessandro
We study the theoretical properties of the fused lasso procedure originally proposed by \cite{tibshirani2005sparsity} in the context of a linear regression model in which the regression coefficient are totally ordered and assumed to be sparse and piecewise constant. Despite its popularity, to the best of our knowledge, estimation error bounds in high-dimensional settings have only been obtained for the simple case in which the design matrix is the identity matrix. We formulate a novel restricted isometry condition on the design matrix that is tailored to the fused lasso estimator and derive estimation bounds for both the constrained version of the fused lasso assuming dense coefficients and for its penalised version. We observe that the estimation error can be dominated by either the lasso or the fused lasso rate, depending on whether the number of non-zero coefficient is larger than the number of piece-wise constant segments. Finally, we devise a post-processing procedure to recover the piecewise-constant pattern of the coefficients. Extensive numerical experiments support our theoretical findings.
Customer Price Sensitivities in Competitive Automobile Insurance Markets
Insurers are increasingly adopting more demand-based strategies to incorporate the indirect effect of premium changes on their policyholders' willingness to stay. However, since in practice both insurers' renewal premia and customers' responses to these premia typically depend on the customer's level of risk, it remains challenging in these strategies to determine how to properly control for this confounding. We therefore consider a causal inference approach in this paper to account for customers' price sensitivity and to deduce optimal, multi-period profit maximizing premium renewal offers. More specifically, we extend the discrete treatment framework of Guelman and Guill\'en (2014) by Extreme Gradient Boosting, or XGBoost, and by multiple imputation to better account for the uncertainty in the counterfactual responses. We additionally introduce the continuous treatment framework with XGBoost to the insurance literature to allow identification of the exact optimal renewal offers and account for any competition in the market by including competitor offers. The application of the two treatment frameworks to a Dutch automobile insurance portfolio suggests that a policy's competitiveness in the market is crucial for a customer's price sensitivity and that XGBoost is more appropriate to describe this than the traditional logistic regression. Moreover, an efficient frontier of both frameworks indicates that substantially more profit can be gained on the portfolio than realized, also already with less churn and in particular if we allow for continuous rate changes. A multi-period renewal optimization confirms these findings and demonstrates that the competitiveness enables temporal feedback of previous rate changes on future demand.
A new LDA formulation with covariates
Shimizu, Gilson, Izbicki, Rafael, Valle, Denis
The Latent Dirichlet Allocation (LDA) model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values and the ability to make predictions for the abundance matrix using the information given by the covariates. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
Regulate Your Regression Model With Ridge, LASSO and ElasticNet
Linear models have a wide appeal. Even with a basic understanding of Excel, it is possible to create a model that explains patterns in data. After attaching weights (coefficients) to explanatory variables (features), it is easy to assess the importance of individual variables when explaining the data. It is not surprising that linear models have been around for many decades, and are widely used throughout many domains, ranging from psychology to business administration and from machine learning to statistics. Despite the superficial simplicity of linear models, many things can go wrong with them.
Develop and Operationalize ML models using plain SQL on Google BigQuery
Not too long ago, data deficiency was a major impediment towards making informed decisions, understanding customer behavior, predictions and forecasting. In the modern digital age, where data continuously streams in all shapes, sizes and from all directions, enterprises are constantly challenged with sifting through petabytes of data to infer key indicators. Making sense of "the right data at the right time" yields a huge competitive edge. Blending real-time streams, batch processing, external data sources and machine learning -- Google BigQuery transcends traditional data warehouse solutions with the ability to offer business insights into data across 3 dimensions -- historical, real-time and predictive. BigQuery democratizes machine learning by letting users develop and operationalize ML models with just SQL skills.
Machine Learning 103: Loss Functions
In two previous articles I covered two of the most basic models used in machine learning -- linear regression and logistic regression. In both cases, we were interested in searching for the set of model parameters m that result in the best model predictions d' of the observed targets d, and in both cases this was done by minimizing some loss function L(m), which measures the error between d' and d. A good proportion of machine learning -- from simple linear regression to deep learning models, essentially involves the minimization of some sort of loss function -- and yet, many data science or machine learning books/tutorials/materials tend to place more emphasis on the model itself than on the loss function! In this article, we will continue on where we left off from the previous two articles and focus on loss functions before exploring more advanced models in future articles! Now, just as "best" is a very subjective word, so are loss functions!
Low-rank features based double transformation matrices learning for image classification
Cai, Yu-Hong, Wu, Xiao-Jun, Chen, Zhe
Linear regression is a supervised method that has been widely used in classification tasks. In order to apply linear regression to classification tasks, a technique for relaxing regression targets was proposed. However, methods based on this technique ignore the pressure on a single transformation matrix due to the complex information contained in the data. A single transformation matrix in this case is too strict to provide a flexible projection, thus it is necessary to adopt relaxation on transformation matrix. This paper proposes a double transformation matrices learning method based on latent low-rank feature extraction. The core idea is to use double transformation matrices for relaxation, and jointly projecting the learned principal and salient features from two directions into the label space, which can share the pressure of a single transformation matrix. Firstly, the low-rank features are learned by the latent low rank representation (LatLRR) method which processes the original data from two directions. In this process, sparse noise is also separated, which alleviates its interference on projection learning to some extent. Then, two transformation matrices are introduced to process the two features separately, and the information useful for the classification is extracted. Finally, the two transformation matrices can be easily obtained by alternate optimization methods. Through such processing, even when a large amount of redundant information is contained in samples, our method can also obtain projection results that are easy to classify. Experiments on multiple data sets demonstrate the effectiveness of our approach for classification, especially for complex scenarios.
Modeling High-Dimensional Data with Unknown Cut Points: A Fusion Penalized Logistic Threshold Regression
Lin, Yinan, Zhou, Wen, Geng, Zhi, Xiao, Gexin, Yin, Jianxin
In traditional logistic regression models, the link function is often assumed to be linear and continuous in predictors. Here, we consider a threshold model that all continuous features are discretized into ordinal levels, which further determine the binary responses. Both the threshold points and regression coefficients are unknown and to be estimated. For high dimensional data, we propose a fusion penalized logistic threshold regression (FILTER) model, where a fused lasso penalty is employed to control the total variation and shrink the coefficients to zero as a method of variable selection. Under mild conditions on the estimate of unknown threshold points, we establish the non-asymptotic error bound for coefficient estimation and the model selection consistency. With a careful characterization of the error propagation, we have also shown that the tree-based method, such as CART, fulfill the threshold estimation conditions. We find the FILTER model is well suited in the problem of early detection and prediction for chronic disease like diabetes, using physical examination data. The finite sample behavior of our proposed method are also explored and compared with extensive Monte Carlo studies, which supports our theoretical discoveries.