Being able to navigate to a target with minimal supervision and prior knowledge is critical to creating human-like assistive agents. Prior work on map-based and map-less approaches have limited generalizability. In this paper, we present a novel approach, Hybrid Asynchronous Universal Successor Representations (HAUSR), which overcomes the problem of generalizability to new goals by adapting recent work on Universal Successor Representations with Asynchronous Actor-Critic Agents. We show that the agent was able to successfully reach novel goals and we were able to quickly fine-tune the network for adapting to new scenes. This opens up novel application scenarios where intelligent agents could learn from and adapt to a wide range of environments with minimal human input.
A major challenge in reinforcement learning (RL) is the design of agents that are able to generalize across tasks that share common dynamics. A viable solution is meta-reinforcement learning, which identifies common structures among past tasks to be then generalized to new tasks (meta-test). In meta-training, the RL agent learns state representations that encode prior information from a set of tasks, used to generalize the value function approximation. This has been proposed in the literature as successor representation approximators. While promising, these methods do not generalize well across optimal policies, leading to sampling-inefficiency during meta-test phases. In this paper, we propose state2vec, an efficient and low-complexity framework for learning successor features which (i) generalize across policies, (ii) ensure sample-efficiency during meta-test. We extend the well known node2vec framework to learn state embeddings that account for the discounted future state transitions in RL. The proposed off-policy state2vec captures the geometry of the underlying state space, making good basis functions for linear value function approximation.
Animals need to devise strategies to maximize returns while interacting with their environment based on incoming noisy sensory observations. Task-relevant states, such as the agent's location within an environment or the presence of a predator, are often not directly observable but must be inferred using available sensory information. Successor representations (SR) have been proposed as a middle-ground between model-based and model-free reinforcement learning strategies, allowing for fast value computation and rapid adaptation to changes in the reward function or goal locations. Indeed, recent studies suggest that features of neural responses are consistent with the SR framework. However, it is not clear how such representations might be learned and computed in partially observed, noisy environments. Here, we introduce a neurally plausible model using distributional successor features, which builds on the distributed distributional code for the representation and computation of uncertainty, and which allows for efficient value function computation in partially observed environments via the successor representation. We show that distributional successor features can support reinforcement learning in noisy environments in which direct learning of successful policies is infeasible.
One question central to Reinforcement Learning is how to learn a feature representation that supports algorithm scaling and re-use of learned information from different tasks. Successor Features approach this problem by learning a feature representation that satisfies a temporal constraint. We present an implementation of an approach that decouples the feature representation from the reward function, making it suitable for transferring knowledge between domains. We then assess the advantages and limitations of using Successor Features for transfer.
Transfer in reinforcement learning refers to the notion that generalization should occur not only within a task but also across tasks. We propose a transfer framework for the scenario where the reward function changes between tasks but the environment's dynamics remain the same. Our approach rests on two key ideas: "successor features", a value function representation that decouples the dynamics of the environment from the rewards, and "generalized policy improvement", a generalization of dynamic programming's policy improvement operation that considers a set of policies rather than a single one. Put together, the two ideas lead to an approach that integrates seamlessly within the reinforcement learning framework and allows the free exchange of information across tasks. The proposed method also provides performance guarantees for the transferred policy even before any learning has taken place. We derive two theorems that set our approach in firm theoretical ground and present experiments that show that it successfully promotes transfer in practice, significantly outperforming alternative methods in a sequence of navigation tasks and in the control of a simulated robotic arm.