Goto

Collaborating Authors

 Reinforcement Learning






12ffb0968f2f56e51a59a6beb37b2859-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their insights and suggestions. Answers below will be included in expanded discussions in future versions of the paper. In the case of R3's car example, as long as states from 10 steps into the future are sampled This is discussed in L211-L215 in Section 6 "Practical Training of ฮณ -Models". The only Monte Carlo trajectory estimates are in the final column for comparison.



Discovery of Useful Questions as Auxiliary Tasks

Neural Information Processing Systems

Arguably, intelligent agents ought to be able to discover their own questions so that in learning answers for them they learn unanticipated useful knowledge and skills; this departs from the focus in much of machine learning on agents learning answers to externally defined questions. We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value functions or GVFs, a fairly rich form of knowledge representation. Specifically, our method uses non-myopic meta-gradients to learn GVF-questions such that learning answers to them, as an auxiliary task, induces useful representations for the main task faced by the RL agent. We demonstrate that auxiliary tasks based on the discovered GVFs are sufficient, on their own, to build representations that support main task learning, and that they do so better than popular hand-designed auxiliary tasks from the literature. Furthermore, we show, in the context of Atari 2600 videogames, how such auxiliary tasks, meta-learned alongside the main task, can improve the data efficiency of an actor-critic agent.



Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

Neural Information Processing Systems

Recently, there has been significant progress in understanding reinforcement learning in discounted infinite-horizon Markov decision processes (MDPs) by deriving tight sample complexity bounds. However, in many real-world applications, an interactive learning agent operates for a fixed or bounded period of time, for example tutoring students for exams or handling customer service requests. Such scenarios can often be better treated as episodic fixed-horizon MDPs, for which only looser bounds on the sample complexity exist. A natural notion of sample complexity in this setting is the number of episodes required to guarantee a certain performance with high probability (P AC guarantee).


Learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning Cong Zhang 1, Wen Song

Neural Information Processing Systems

In the paper, we adopt the Proximal Policy Optimization (PPO) algorithm [36] to train our agent. Here we provide details of our algorithm in terms of pseudo code, as shown in Algorithm 1. Similar In this section, we show how the baseline PDRs compute the priority index for the operations. Here we present the complete results on Taillard's benchmark. In Table S.1, we report the results of In Table S.2, we report the generalization performance of our polices trained on The "UB" column is the best solution from The "UB" column is the best solution from Similar conclusion can be drawn from results on DMU benchmark. In Table S.3, we report results In Table S.4 which focuses on The "UB" column is the best solution from The "UB" column is the best solution from We show training curves for all problems in Figure.1.