[D]Why are non-linear approximators such as neural networks unstable for reinforcement learning

#artificialintelligence 

As you know, in supervised learning it is important for the data to be iid. In RL the training data is sampled from the state space that the agent chooses to explore, which tends to be highly correlated to the agent's current preferences and a small subset of the total state space. Q learning selects the action with the highest expected reward. So if a1 has an expected reward of 0.49, and a2 has an expected reward of 0.51, a small parameter change can cause the agent to swap from picking a2 100% of the time to picking a1 100% of the time, causing a significant shift in the distribution of data being trained on. At a higher conceptual level, you can think of RL as supervised learning where instead of having clearly defined labels, you'guess' what the label is using a often times noisy reward signal, and the quality of your guess is based on how accurate your policy is.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found