Supplementary Policy

Neural Information Processing Systems 

Let t(s, a)= Q(s, a) ˆQ (s, a)andFt(s, a)= rpeer+ maxb2 AQ(s0,b) ˆQ (s, a). In(A4), we robust DQNalgorithmwithpeersampling, inwhichtheoriginlossis`((s, a), y), also calibrated.