Asynchronous n-steps Q-learning

#artificialintelligence 

Q-learning is the most famous Temporal Difference algorithm. Original Q-learning algorithm tries to determine the state-action value function that minimizes the error below. We will use an optimizer (the simplest one- Gradient Descent) to compute the values of the state-action function. First of all we need to compute the gradient of the loss function. Gradient descent finds the minimum of a function by subtracting the gradient, with respect to the parameters of the function, from the parameters.