Linear Regression with Python
Let's start with a simple problem, we suppose that we have a small dataset with house prices for a specific area in a city, the database contains two fields, the size of the house and its price (SIZE, PRICE), and I would like to know the price of a house with a specific size, the problem is that I don't have that size in my dataset, what should I do? We already know from the title that the solution is linear regression, but to explain more easier, I've a collected a little dataset that contains house prices, in the table below a snippet from the dataset: Visualization helps us a lot in identifying patterns in data, that's way to have a better view to our dataset, I m going to plot it using matplotlib python library: From the plotting we can see that the price grows with the size, but the points don't make a prefect line that can help us predict the price of a new size, so we need to find a linear function h(x) that passes next to all the points but not necessary over them, we call the function the hypothesis: In the equation 2, m is the size of our dataset, Xi is the ith price and Yi is the ith size in the dataset, we call J the error function (or the objective function) that we need to minimize. There are other error functions or estimators in statistics that we can use, but in our case we'll use the MSE or the mean squared error estimator, because it will help us find our unknowns parameters more easier, our function will become: The estimator J takes two arguments, which means it's a 3D function, the figure 3 shows how the function looks like in a 3D graph, our goal here is to find the minimum value, which is the lowest point in the graph below, imagine putting a ball inside the graph, the ball will slide into the bottom of the shape. To find the lowest point in the shape, or in another word minimizing the objective function, we'll use the gradient descent algorithm, which is very simple to understand. To reach the bottom of the shape, we will choose randomly a point in the graph, that's mean setting θ0 and θ1 to a random value, at that point we need to decide, do we need to go up or down?
Jan-27-2017, 19:05:55 GMT
- Technology: