Regression
Semi-Supervised Empirical Risk Minimization: When can unlabeled data improve prediction
We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we provide a careful treatment of the effectiveness of the SSL to improve prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where the SSL outperforms both the ERM learning and the null model. In the special case of linear regression with Gaussian covariates, we show that the previously suggested semi-supervised estimator is in fact not capable of improving on both the supervised estimator and the null model simultaneously. However, the new estimator presented in this work, can achieve an improvement of $O(1/n)$ term over both competitors simultaneously. On the other hand, we show that in other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions, having unlabeled data can derive substantial improvement in prediction by applying our suggested SSL approach. Moreover, it is possible to identify the usefulness of the SSL, by using the dedicated formulas we establish throughout this work. This is shown empirically through extensive simulations.
Boosting House Price Predictions using Geo-Spatial Network Embedding
Das, Sarkar Snigdha Sarathi, Ali, Mohammed Eunus, Li, Yuan-Fang, Kang, Yong-Bin, Sellis, Timos
Real estate contributes significantly to all major economies around the world. In particular, house prices have a direct impact on stakeholders, ranging from house buyers to financing companies. Thus, a plethora of techniques have been developed for real estate price prediction. Most of the existing techniques rely on different house features to build a variety of prediction models to predict house prices. Perceiving the effect of spatial dependence on house prices, some later works focused on introducing spatial regression models for improving prediction performance. However, they fail to take into account the geo-spatial context of the neighborhood amenities such as how close a house is to a train station, or a highly-ranked school, or a shopping center. Such contextual information may play a vital role in users' interests in a house and thereby has a direct influence on its price. In this paper, we propose to leverage the concept of graph neural networks to capture the geo-spatial context of the neighborhood of a house. In particular, we present a novel method, the Geo-Spatial Network Embedding (GSNE), that learns the embeddings of houses and various types of Points of Interest (POIs) in the form of multipartite networks, where the houses and the POIs are represented as attributed nodes and the relationships between them as edges. Extensive experiments with a large number of regression techniques show that the embeddings produced by our proposed GSNE technique consistently and significantly improve the performance of the house price prediction task regardless of the downstream regression model.
Linear Regression Coefficients Are Probably Lying to You
Interpreting linear regression coefficients is common to do, because it's so easy. Training a model can be done in a few lines of code, and the results yield statistics that can be stated matter-of-factly: "each additional point on the SAT increases your chances of admission by 0.002%". Whenever you train a linear regression (or logistic regression) model with this intent, be wary: you are treading in dangerous waters. What is linear regression even doing? It multiplies each of the inputs by a value and adds them up -- as an additional degree of freedom, an'intercept' can be added.
Predicting Car Price: EDA, Regression, Hypothesis Testing
I am predicting the selling price of the car based on various features of the cars, including the present price of the cars. I will be using Multiple Linear Regression for building The model. Let's dive under to understand the variables and use the correlation matrix to make the process easy. Now let's check if we have Outliers in our data. So Rather then removing the outliers values we would like to take log of them.
Common Loss functions in machine learning for a Regression model
Machine learning is a pioneer subset of Artificial Intelligence, where Machines learn by itself using the available dataset. For the optimization of any machine learning model, an acceptable loss function must be selected. A Loss function characterizes how well the model performs over the training dataset. Loss functions express the discrepancy between the predictions of the model being trained and also the actual problem instances. If the deviation between predicted result and actual results is too much, then loss function would have a very high value.
Random Forest (RF) Kernel for Regression, Classification and Survival
Feng, Dai, Baumgartner, Richard
Breiman's random forest (RF) can be interpreted as an implicit kernel generator,where the ensuing proximity matrix represents the data-driven RF kernel. Kernel perspective on the RF has been used to develop a principled framework for theoretical investigation of its statistical properties. However, practical utility of the links between kernels and the RF has not been widely explored and systematically evaluated.Focus of our work is investigation of the interplay between kernel methods and the RF. We elucidate the performance and properties of the data driven RF kernels used by regularized linear models in a comprehensive simulation study comprising of continuous, binary and survival targets. We show that for continuous and survival targets, the RF kernels are competitive to RF in higher dimensional scenarios with larger number of noisy features. For the binary target, the RF kernel and RF exhibit comparable performance. As the RF kernel asymptotically converges to the Laplace kernel, we included it in our evaluation. For most simulation setups, the RF and RFkernel outperformed the Laplace kernel. Nevertheless, in some cases the Laplace kernel was competitive, showing its potential value for applications. We also provide the results from real life data sets for the regression, classification and survival to illustrate how these insights may be leveraged in practice.Finally, we discuss further extensions of the RF kernels in the context of interpretable prototype and landmarking classification, regression and survival. We outline future line of research for kernels furnished by Bayesian counterparts of the RF.
Causal Inference in Possibly Nonlinear Factor Models
This paper develops a general causal inference method for treatment effects models under selection on unobservables. A large set of covariates that admits an unknown, possibly nonlinear factor structure is exploited to control for the latent confounders. The key building block is a local principal subspace approximation procedure that combines $K$-nearest neighbors matching and principal component analysis. Estimators of many causal parameters, including average treatment effects and counterfactual distributions, are constructed based on doubly-robust score functions. Large-sample properties of these estimators are established, which only require relatively mild conditions on the principal subspace approximation. The results are illustrated with an empirical application studying the effect of political connections on stock returns of financial firms, and a Monte Carlo experiment. The main technical and methodological results regarding the general local principal subspace approximation method may be of independent interest.
Heart Disease predictions using Logistic Regression – Sushrut Tendulkar
The main purpose of this post is to explore the different ways in which Logistic Regression can be applied to the dataset and hence understanding how the model actually works. The idea is not to solve the problem itself. This post doesn't focus on getting best score using different models however it assumes that there's only one model available for use. This is part of the series of posts to learn and share the details of Logistic Regression. If you're new to this kindly refer my earlier posts on the same topic: The data set has different features like Demographics, Behavioural which includes current smoker, cigarettes per day and Medical history and our task is to predict if the person has 10 year risk of coronary heart disease.
How is Machine Learning Useful for Macroeconomic Forecasting?
Coulombe, Philippe Goulet, Leroux, Maxime, Stevanovic, Dalibor, Surprenant, Stéphane
We move beyond "Is Machine Learning Useful for Macroeconomic Forecasting?" by adding the "how". The current forecasting literature has focused on matching specific variables and horizons with a particularly successful algorithm. In contrast, we study the usefulness of the underlying features driving ML gains over standard macroeconometric methods. We distinguish four so-called features (nonlinearities, regularization, cross-validation and alternative loss function) and study their behavior in both the data-rich and data-poor environments. To do so, we design experiments that allow to identify the "treatment" effects of interest. We conclude that (i) nonlinearity is the true game changer for macroeconomic prediction, (ii) the standard factor model remains the best regularization, (iii) K-fold cross-validation is the best practice and (iv) the $L_2$ is preferred to the $\bar \epsilon$-insensitive in-sample loss. The forecasting gains of nonlinear techniques are associated with high macroeconomic uncertainty, financial stress and housing bubble bursts. This suggests that Machine Learning is useful for macroeconomic forecasting by mostly capturing important nonlinearities that arise in the context of uncertainty and financial frictions.
Linear Regression Algorithm --Under The Hood Math For Non-Mathematicians
Step 1: We will use the python package NumPy for working with a sample dataset and Matplotlib to plot various graphs for visualisation. Step 2: Let us consider a simple scenario where a single input /independent variable controls the outcome/dependent variable value. In the code below, we have declared two NumPy arrays to hold the values of the independent and dependent variables. Step 3: Let us quickly draw a scatter plot to understand the data points. Our goal is to formulate a linear equation which can predict the dependent variable value with minimum error for an independent/input variable.