learning reusable option
Learning Reusable Options for Multi-Task Reinforcement Learning
The option-critic architecture [2] is a more direct approach that learns options and a policy over options simultaneously. The option policies and their termination functions are trained using policy gradient methods, while the policy over options may be trained using any technique. One issue that often arises within this framework is that the termination functions of the learned options tend to collapse to "always terminate". In a later publication, the authors built on this work to consider the case where there is a cost associated with switching options [6]. This method resulted in the agent learning to use a single option while it was appropriate and terminate when an option switch was needed, allowing it to discover improved policies for a particular task.
Learning Reusable Options for Multi-Task Reinforcement Learning
Garcia, Francisco M., Nota, Chris, Thomas, Philip S.
One of the main reasons why RL has worked so well in these applications is that we are able simulate millions of interactions with the environment in a relatively short period of time, allowing the agent to experience a large number of different situations in the environment and learn the consequences of its actions. In many real world applications, however, where the agent interacts with the physical world, it might not be easy to generate such a large number of interactions. The time and cost associated with training such systems could render RL an unfeasible approach for training in large scale. As a concrete example, consider training a large number of humanoid robots (agents) to move quickly, as in the Robocup competition [ Farchy et al., 2013 ] . Although the agents have similar dynamics, subtle variations mean that a single policy shared across all agents would not be an effective solution.