Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

Open in new window