temporal abstraction
Strategic Attentive Writer for Learning Macro-Actions
Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, koray kavukcuoglu
We present a novel deep recurrent neural network architecture that learns to build implicit plans in an end-to-end manner purely by interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub-sequences by learning for how long the plan can be committed to - i.e. followed without replaning. Combining these properties, the proposed model, dubbed STRategic Attentive Writer (STRAW) can learn high-level, temporally abstracted macro-actions of varying lengths that are solely learnt from data without any prior information. These macro-actions enable both structured exploration and economic computation. We experimentally demonstrate that STRAW delivers strong improvements on several ATARI games by employing temporally extended planning strategies (e.g.
Regret Minimization in MDPs with Options without Prior Knowledge
Recent works leveraged on the mapping of Markov decision processes (MDPs) with options to semi-MDPs (SMDPs) and introduced SMDP-versions of exploration-exploitation algorithms (e.g., RMAX-SMDP and UCRL-SMDP) to analyze the impact of options on the learning performance. Nonetheless, the PAC-SMDP sample complexity of RMAX-SMDP can hardly be translated into equivalent PAC-MDP theoretical guarantees, while UCRL-SMDP requires prior knowledge of the parameters characterizing the distributions of the cumulative reward and duration of each option, which are hardly available in practice. In this paper, we remove this limitation by combining the SMDP view together with the inner Markov structure of options into a novel algorithm whose regret performance matches UCRL-SMDP's up to an additive regret term. We show scenarios where this term is negligible and the advantage of temporal abstraction is preserved. We also report preliminary empirical result supporting the theoretical findings.
Variational Temporal Abstraction
Taesup Kim, Sungjin Ahn, Yoshua Bengio
There have been approaches to learn such hierarchical structure in sequences such as the HMRNN (Chung et al., 2016). However, as a deterministic model, it has the main limitation that it cannot capture the stochastic nature prevailing in the data. In particular,this is acritical limitation to imagination-augmented agents because exploring various possible futures according to the uncertainty is what makes the imagination meaningful in many cases.
RankingPolicyDecisions
Inarunwith ntimesteps,apolicy will makendecisions on actions totake; we conjecture that only asmall subset of these decisions delivers value over selecting a simple default action. Given atrained policy,we propose anovel black-box method based on statistical fault localisation that ranks thestates oftheenvironment according totheimportance ofdecisions made inthose states. Weargue that among other things, theranked list ofstates can help explain and understand the policy. As the ranking method is statistical, a direct evaluation of its quality is hard.