Student-Informed Teacher Training
Messikommer, Nico, Xing, Jiaxu, Aljalbout, Elie, Scaramuzza, Davide
–arXiv.org Artificial Intelligence
Our method leverages three networks (a), which are trained in three alternating phases: the roll-out phase (b), the policy update phase (c), and the alignment phase (d). The grey boxes represent networks frozen during the specific phase and the dashed arrows indicate the gradient flow. Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher's behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters' limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks. In reinforcement learning (RL), an agent learns to perform a task by interacting with its environment and maximizing the cumulative rewards gained through these interactions. This work was supported by the European Research Council (ERC) under grant agreement No. 864042 (AGILEFLIGHT) However, this process requires extensive exploration, as the agent must avoid getting trapped in local minima, often resulting in a large number of environment interactions (Pathak et al., 2017). The number of interactions is even further increased when the agent processes high-dimensional data as input (Ota et al., 2020). Using such observations, the policy must learn to extract a notion of the agent's state, a process that is computationally expensive when optimized solely through RL.
arXiv.org Artificial Intelligence
Dec-12-2024
- Country:
- Genre:
- Research Report > New Finding (0.93)
- Industry:
- Education > Teacher Education (0.41)
- Leisure & Entertainment > Games (0.46)
- Technology: