Outcome-directed Reinforcement Learning by Uncertainty & Temporal Distance-Aware Curriculum Goal Generation
Cho, Daesol, Lee, Seungjae, Kim, H. Jin
–arXiv.org Artificial Intelligence
While reinforcement learning (RL) shows promising results in automated learning of behavioral skills, it is still not enough to solve a challenging uninformed search problem where the desired behavior and rewards are sparsely observed. Some techniques tackle this problem by utilizing the shaped reward (Hartikainen et al., 2019) or combining representation learning for efficient exploration (Ghosh et al., 2018). But, these not only become prohibitively time-consuming in terms of the required human efforts, but also require significant domain knowledge for shaping the reward or designing the task-specific representation learning objective. What if we could design the algorithm that automatically progresses toward the desired behavior without any domain knowledge and human efforts, while distilling the experiences into the general purpose policy? An effective scheme for designing such an algorithm is one that learns on a tailored sequence of curriculum goals, allowing the agent to autonomously practice the intermediate tasks. However, a fundamental challenge is that proposing the curriculum goal to the agent is intimately connected to the efficient desired outcome-directed exploration and vice versa. If the curriculum generation is ineffective for recognizing frontier parts of the explored and feasible areas, an efficient exploration toward the desired outcome states cannot be performed. Even though some prior works propose to modify the curriculum distribution into a uniform one over the feasible state space (Pong et al., 2019; Klink et al., 2022) or generate a curriculum based on the level of difficulty (Florensa et al., 2018; Sukhbaatar et al., 2017), most of these methods show slow curriculum progress due to the process of skewing the curriculum distribution toward the uniform one rather than the frontier of the explored region or the properties that are susceptible to focusing on infeasible goals where the agent's capability stagnates in the intermediate level of difficulty.
arXiv.org Artificial Intelligence
Feb-20-2023