On Why Gradient Descent is Even Needed – Daniel Burkhardt Cerigo – Medium
Gradient descent is taught as a de facto part of machine learning, but when I got asked some questions that brought up why we even use it, I realised I wasn't crystal clear on an answer, so I went and made sure of why myself. I was giving a presentation to a set of very talented young mathematicians at King's College London Mathematics School, and during that talk I showed a slide from the classic Stanford's Andrew Ng's MOOC Machine Learning course. It shows how the Cost Function J(or Error or Loss) varies as we alter our model parameters θ1 and θ2, or as we "move" in parameter space -- thus creating a surface. This slide is shown to visually represent and help to understand how gradient descent works. We start at the upper most point (black x-mark), and take a short step in the direction of the gradient of the surface at that point (strictly it's the opposite direction of the gradient so we go "down" and not "up"), with the goal that we get to a trough or minimum of the cost function and thus our model makes preditions that are close(r) to the actual labels of our training data. We had already had a Q&A post talk, but after a few students approached me with more detailed questions.
Oct-29-2018, 18:49:02 GMT