When intelligent agents learn visuomotor behaviors from human demonstrations, they may benefit from knowing where the human is allocating visual attention, which can be inferred from their gaze. A wealth of information regarding intelligent decision making is conveyed by human gaze allocation; hence, exploiting such information has the potential to improve the agents' performance. With this motivation, we propose the AGIL (Attention Guided Imitation Learning) framework. We collect high-quality human action and gaze data while playing Atari games in a carefully controlled experimental setting. Using these data, we first train a deep neural network that can predict human gaze positions and visual attention with high accuracy (the gaze network) and then train another network to predict human actions (the policy network). Incorporating the learned attention model from the gaze network into the policy network significantly improves the action prediction accuracy and task performance.
We propose a model-free deep reinforcement learning method that leverages a small amount of demonstration data to assist a reinforcement learning agent. We apply this approach to robotic manipulation tasks and train end-to-end visuomotor policies that map directly from RGB camera inputs to joint velocities. We demonstrate that our approach can solve a wide variety of visuomotor tasks, for which engineering a scripted controller would be laborious. Our experiments indicate that our reinforcement and imitation agent achieves significantly better performances than agents trained with reinforcement learning or imitation learning alone. We also illustrate that these policies, trained with large visual and dynamics variations, can achieve preliminary successes in zero-shot sim2real transfer. A brief visual description of this work can be viewed in https://youtu.be/EDl8SQUNjj0
For their study, researchers from New York University Shanghai and University of Hong Kong had 80 students and faculty from the University of Hong Kong participate in several experiments involving different video games. Action-based video games, for example, force the gamer to respond to visual cues. Think driving-centric games, like "Mario Kart," or first-person shooter games, such as "Unreal Tournament." Non-action games, on the other hand, include those like "Sims 2" and "Roller Coaster Tycoon," where the gamer is responsible for directing the action. In one experiment, subjects with no action-based video game experience were asked to played "Mario Kart" or a first-person shooter game.
Adaptation of visually guided reaching movements in novel visuomotor environments (e.g.wearing prism goggles) comprises not only motor adaptation but also substantial sensory adaptation, corresponding to shifts in the perceived spatial location of visual and proprioceptive cues. Previous computational modelsof the sensory component of visuomotor adaptation have assumed that it is driven purely by the discrepancy introduced between visual andproprioceptive estimates of hand position and is independent of any motor component of adaptation. We instead propose a unified model in which sensory and motor adaptation are jointly driven by optimal Bayesian estimation of the sensory and motor contributions to perceived errors. Our model is able to account for patterns of performance errors during visuomotor adaptationas well as the subsequent perceptual aftereffects. This unified model also makes the surprising prediction that force field adaptation willelicit similar perceptual shifts, even though there is never any discrepancy between visual and proprioceptive observations. We confirm this prediction with an experiment.
Zhang, Luxin (Peking University) | Zhang, Ruohan (The University of Texas at Austin) | Liu, Zhuode (The University of Texas at Austin) | Hayhoe, Mary M. (The University of Texas at Austin) | Ballard, Dana H. (The University of Texas at Austin)
A wealth of information regarding intelligent decision making is conveyed by human gaze and visual attention, hence, modeling and exploiting such information might be a promising way to strengthen algorithms like deep reinforcement learning. We collect high-quality human action and gaze data while playing Atari games. Using these data, we train a deep neural network that can predict human gaze positions and visual attention with high accuracy.