Overfitting vs. Underfitting In Linear Regression
In the previous courses, we have introduced linear and logistic regression, to model a Y variable which is discrete or continuous from one or more Xi variables, in all the examples used to illustrate this technique the modeling was relatively simple, the variable Y was generally modeled by a line parameterized by the variables Xi, but this modeling cannot be applied every time, an aquatic model must be chosen w.r.t to our data, in order to have the best fit. In this course we will study the effect of the choice of this modeling, we will see two cases, the first when the modeling is too weak to model our data, and the second is when the modeling is over-parameterized and that it will over-fit our data. Let's take a simple example and see what different modeling choices will produce in the fit of the data, we will use the following python code to generate and visualize the data, The figure above shows different fits for different choices of modeling assumptions, the first figure shows the simplest choice, modeling by a straight line of our data, in this case, we can notice that the modeling is very weak and we do not end with a good fit to our data, in this case, we are talking about underfitting, that is, the starting hypothesis is too weak for our data set. In this case, we notice that the modeling is over-parameterized, which gives an over-adjustment of our data without having a correct trajectory, we can notice that at the edge, we have a significant oscillation, which can mislead us if we want to predict the value of a new point which is at the edge, in this case, we speak of overfitting, that is to say, that our starting hypothesis is over-parameterized for our data. To sum up, when modeling data we can face two problems, first we can have a hypothesis that fails to model our data, and second, we can have a hypothesis that is over-parameterized and which will over-fit our data without the power to generalize to new examples, a trade-off must be made between the desired level of fit and the ability to generalize to new cases in order to have the best fit to the data.
Oct-24-2021, 10:30:08 GMT
- Technology: