Machine Learning Basics: Building a Regression model in R


The course "Machine Learning Basics: Building a Regression model in R" teaches you all the steps of creating a Linear Regression model, which is the most popular Machine Learning model, to solve business problems. Machine Learning is a field of computer science which gives the computer the ability to learn without being explicitly programmed. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. What is the Linear regression technique of Machine learning? Linear Regression is a simple machine learning model for regression problems, i.e., when the target variable is a real value.



Other than the various cross-validation functions, the main functions are polyfit() and predict.polyFit(). One can fit either regression or classification models, with an option to perform PCA for dimension reduction on the predictors/features. Built in to the latest version of the regtools package. In the former case, getPE() reads in the dataset and does some preprocessing, producing a data frame pe. Forward stepwise regression is also available with FSR which also accepts polynomial degree and interaction as inputs.

Training Regression Models – Towards Data Science


You have been observing that since the past few years, happy employees are the key profit generators of your company and in all these years, you noted down the happiness index of all your employees and their productivity. Now you have tons of this employees' data just lying around in excel files and you just recently heard "Data is the new oil. The companies that will win are using math." You are wondering if you could also win by somehow mathifying this data that could predict the productivity of your new employees based on their happiness index. So that it would become easier for you to identify your least productive employees (and then supposedly fire them-just supposedly).

Accounting for Significance and Multicollinearity in Building Linear Regression Models Machine Learning

We derive explicit Mixed Integer Optimization (MIO) constraints, as opposed to iteratively imposing them in a cutting plane framework, that impose significance and avoid multicollinearity for building linear regression models. In this way we extend and improve the research program initiated in Bertsimas and King (2016) that imposes sparsity, robustness, pairwise collinearity and group sparsity explicitly and significance and avoiding multicollinearity iteratively. We present a variety of computational results on real and synthetic datasets that suggest that the proposed MIO has a significant computational edge compared to Bertsimas and King (2016) in accuracy, false detection rate and computational time in accounting for significance and multicollinearity as well as providing a holistic framework to produce regression models with desirable properties a priori.

New Risk Bounds for 2D Total Variation Denoising Machine Learning

2D Total Variation Denoising (TVD) is a widely used technique for image denoising. It is also an important non parametric regression method for estimating functions with heterogenous smoothness. Recent results have shown the TVD estimator to be nearly minimax rate optimal for the class of functions with bounded variation. In this paper, we complement these worst case guarantees by investigating the adaptivity of the TVD estimator to functions which are piecewise constant on axis aligned rectangles. We rigorously show that, when the truth is piecewise constant, the ideally tuned TVD estimator performs better than in the worst case. We also study the issue of choosing the tuning parameter. In particular, we propose a fully data driven version of the TVD estimator which enjoys similar worst case risk guarantees as the ideally tuned TVD estimator.

Fair Regression for Health Care Spending Machine Learning

The distribution of health care payments to insurance plans has substantial consequences for social policy. Risk adjustment formulas predict spending in health insurance markets in order to provide fair benefits and health care coverage for all enrollees, regardless of their health status. Unfortunately, current risk adjustment formulas are known to undercompensate payments to health insurers for specific groups of enrollees (by underpredicting their spending). Much of the existing algorithmic fairness literature for group fairness to date has focused on classifiers and binary outcomes. To improve risk adjustment formulas for undercompensated groups, we expand on concepts from the statistics, computer science, and health economics literature to develop new fair regression methods for continuous outcomes by building fairness considerations directly into the objective function. We additionally propose a novel measure of fairness while asserting that a suite of metrics is necessary in order to evaluate risk adjustment formulas more fully. Our data application using the IBM MarketScan Research Databases and simulation studies demonstrate that these new fair regression methods may lead to massive improvements in group fairness with only small reductions in overall fit.

ggeffects 0.8.0 now on CRAN: marginal effects for regression models #rstats


I'm happy to announce that version 0.8.0 of my ggeffects-package is on CRAN now. The update has fixed some bugs from the previous version and comes along with many new features or improvements. One major part that was addressed in the latest version are fixed and improvements for mixed models, especially zero-inflated mixed models (fitted with the glmmTMB-package). In this post, I want to demonstrate the different options to calculate and visualize marginal effects from mixed models. Basically, the type of predictions, i.e. whether to account for the uncertainty of random effects or not, can be set with the type-argument.

Visualizing and assessing discrimination in the logistic regression model. - PubMed - NCBI


Logistic regression models are widely used in medicine for predicting patient outcome (prognosis) and constructing diagnostic tests (diagnosis). Multivariable logistic models yield an (approximately) continuous risk score, a transformation of which gives the estimated event probability for an individual. A key aspect of model performance is discrimination, that is, the model's ability to distinguish between patients who have (or will have) an event of interest and those who do not (or will not). Graphical aids are important in understanding a logistic model. The receiver-operating characteristic (ROC) curve is familiar, but not necessarily easy to interpret.

Comparing Multilayer Perceptron and Multiple Regression Models for Predicting Energy Use in the Balkans Machine Learning

Global demographic and economic changes have a critical impact on the total energy consumption, which is why demographic and economic parameters have to be taken into account when making predictions about the energy consumption. This research is based on the application of a multiple linear regression model and a neural network model, in particular multilayer perceptron, for predicting the energy consumption. Data from five Balkan countries has been considered in the analysis for the period 1995-2014. Gross domestic product, total number of population, and CO2 emission were taken as predictor variables, while the energy consumption was used as the dependent variable. The analyses showed that CO2 emissions have the highest impact on the energy consumption, followed by the gross domestic product, while the population number has the lowest impact. The results from both analyses are then used for making predictions on the same data, after which the obtained values were compared with the real values. It was observed that the multilayer perceptron model predicts better the energy consumption than the regression model.

Nuclear Norm Regularized Estimation of Panel Regression Models Machine Learning

In this paper we investigate panel regression models with interactive fixed effects. We propose two new estimation methods that are based on minimizing convex objective functions. The first method minimizes the sum of squared residuals with a nuclear (trace) norm regularization. The second method minimizes the nuclear norm of the residuals. We establish the consistency of the two resulting estimators. Those estimators have a very important computational advantage compared to the existing least squares (LS) estimator, in that they are defined as minimizers of a convex objective function. In addition, the nuclear norm penalization helps to resolve a potential identification problem for interactive fixed effect models, in particular when the regressors are low-rank and the number of the factors is unknown. We also show how to construct estimators that are asymptotically equivalent to the least squares (LS) estimator in Bai (2009) and Moon and Weidner (2017) by using our nuclear norm regularized or minimized estimators as initial values for a finite number of LS minimizing iteration steps. This iteration avoids any non-convex minimization, while the original LS estimation problem is generally non-convex, and can have multiple local minima.