Regression
Trees, forests, and impurity-based variable importance
Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction process. Unfortunately, random forests are not intrinsically interpretable since their prediction results from averaging several hundreds of decision trees. A classic approach to gain knowledge on this so-called black-box algorithm is to compute variable importances, that are employed to assess the predictive impact of each input variable. Variable importances are then used to rank or select variables and thus play a great role in data analysis. Nevertheless, there is no justification to use random forest variable importances in such way: we do not even know what these quantities estimate. In this paper, we analyze one of the two well-known random forest variable importances, the Mean Decrease Impurity (MDI). We prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output, where the contribution of each variable is clearly identified. We also study models exhibiting dependence between input variables or interaction, for which the variable importance is intrinsically ill-defined. Our analysis shows that there may exist some benefits to use a forest compared to a single tree.
A Smooth Introduction to Linear Regression and its Implementation in PyTorch (Part-I)
The term linear regression might seem a bit complex at first, but don't worry, it is a simple but powerful concept. Let us go through an example to understand the concept better. I will throw some random data here and see what happens. Say that we have two variables h and r, that refer to the time of the day and how many pages a person read, respectively. Let's assume that the person starts reading at 9:00 AM and finishes at 2:00 PM.
Sample Complexity of Learning Mixture of Sparse Linear Regressions
Krishnamurthy, Akshay, Mazumdar, Arya, McGregor, Andrew, Pal, Soumyabrata
In the problem of learning mixtures of linear regressions, the goal is to learn a col-lection of signal vectors from a sequence of (possibly noisy) linear measurements,where each measurement is evaluated on an unknown signal drawn uniformly fromthis collection. This setting is quite expressive and has been studied both in termsof practical applications and for the sake of establishing theoretical guarantees. Inthis paper, we consider the case where the signal vectors aresparse; this generalizesthe popular compressed sensing paradigm. We improve upon the state-of-the-artresults as follows: In the noisy case, we resolve an open question of Yin et al. (IEEETransactions on Information Theory, 2019) by showing how to handle collectionsof more than two vectors and present the first robust reconstruction algorithm, i.e.,if the signals are not perfectly sparse, we still learn a good sparse approximationof the signals. In the noiseless case, as well as in the noisy case, we show how tocircumvent the need for a restrictive assumption required in the previous work.
WTH are R-squared and Adjusted R-squared?
Today I am going to explain the concept of R-squared and adjusted R-squared from the Machine Learning perspective. I'll also show you how to find the R-squared value of your ML model. It acts as an evaluation metric for regression models. To understand it better let me introduce a regression problem. Suppose I'm building a model to predict how many articles I will write in a particular month given the amount of free time I have on that month.
Logistic Regression Example - Beyond Whyy
In this section, we will try to implement a very simple example of logisitic regression. For this example, we will try to predict the result(pass/fail) of a student based on his 2 test scores. Notice that the predicted result is a value between 0 and 1, it can be interpreted as the probability that the result is a positive (y 1). In our example, for test scores 60 and 75, there is a 87% probability that the student passes. An additional condition like the following can be used inside the predict function for binary output.
Logistic Regression-2 - Beyond Whyy
In the last section, we saw that linear regression hypothesis function needed to be modified inorder to be used for logistic regression problems. Now the obvious question is whether the same cost function can be used here also or is some modification necessary required. The reason that cost function worked for linear regression was because the hypothesis was linear and hence the cost function was convex shaped with a single global minimum. In case of logistic regression, the hypothesis function is no longer linear because of the sigmoid function and using the same definition for cost function would yield a function as shown below. It will be an impossible task to optimize the parameters theta when using such a cost function with so many local minimums.
Logistic Regression-1 - Beyond Whyy
The second type of Supervised learning algorithm we will look into is called the Logistic Regression. First things first, it is not a regression algorithm as its name suggests, rather it's used classification purposes. But the name comes from the fact that classification is done by considering the linear regression model. As mentioned earlier, Classification methods are used in problems where we want to label or categorize inputs. In this section we will be dealing with binary classification problems (those with 2 possible outcomes).
Logistic Regression-1 - Beyond Whyy
The second type of Supervised learning algorithm we will look into is called the Logistic Regression. First things first, it is not a regression algorithm as its name suggests, rather it's used classification purposes. But the name comes from the fact that classification is done by considering the linear regression model. As mentioned earlier, Classification methods are used in problems where we want to label or categorize inputs. In this section we will be dealing with binary classification problems (those with 2 possible outcomes).
Review of Probability Distributions for Modeling Count Data
Count data take on non-negative integer values and are challenging to properly analyze using standard linear-Gaussian methods such as linear regression and principal components analysis. Generalized linear models enable direct modeling of counts in a regression context using distributions such as the Poisson and negative binomial. When counts contain only relative information, multinomial or Dirichlet-multinomial models can be more appropriate. We review some of the fundamental connections between multinomial and count models from probability theory, providing detailed proofs. These relationships are useful for methods development in applications such as topic modeling of text data and genomics.
Forecasting US Equity Market Returns with Machine Learning
Shiller's CAPE ratio is a popular and useful metric for measuring whether stock prices are overvalued or undervalued relative to earnings. Recently, Vanguard analysts Haifeng Wang, Harshdeep Singh Ahluwalia, Roger A. Aliaga-Díaz, and Joseph H. Davis have written a very interesting paper on forecasting equity returns using Shiller's CAPE and machine learning: "The Best of Both Worlds: Forecasting US Equity Market Returns using a Hybrid Machine Learning – Time Series Approach". First, what is the Shiller CAPE ratio? 1 If we do a simple regression of Shiller's CAPE ratio against future 10-year returns, we observe a very strong relationship. Here we see a historical chart of actual 10-year annualized stock returns vs. those predicted by Shiller's CAPE. Using the Shiller regression, the current CAPE of about 30 suggests near-zero real return over the next 10 years.