Reinforcement Learning
Learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning Cong Zhang 1, Wen Song
In the paper, we adopt the Proximal Policy Optimization (PPO) algorithm [36] to train our agent. Here we provide details of our algorithm in terms of pseudo code, as shown in Algorithm 1. Similar In this section, we show how the baseline PDRs compute the priority index for the operations. Here we present the complete results on Taillard's benchmark. In Table S.1, we report the results of In Table S.2, we report the generalization performance of our polices trained on The "UB" column is the best solution from The "UB" column is the best solution from Similar conclusion can be drawn from results on DMU benchmark. In Table S.3, we report results In Table S.4 which focuses on The "UB" column is the best solution from The "UB" column is the best solution from We show training curves for all problems in Figure.1.
Appendix B, we provide sufficient conditions for Assumption 1 that were mentioned in the main
In Appendix A we introduce some basic definitions that are needed for our theoretical results. In Appendix C and Appendix D we prove the error bounds for PPI and PQI. All the other dynamics are preserved. Rewards are 0 for the absorbing action and unchanged elsewhere. Algorithm 1 and 2. As some of the notations is actually a function of the MDP, we clarify the usage Recall the definition of semi-norm of any function of state-action pairs.