Goto

Collaborating Authors

 hzi


Baselines

Neural Information Processing Systems

As shown in the main text, under the assumption that the influence network is unbiased, our factor baselines are indeed valid control variates. We prove this result below, repeating the statement itself for posterity and providing a supplementary lemma on control variates as a restatement of known results. Let X, Y and Zbe random variables where the law of Xconditional on Z is denoted Pθ(X|Z), and Y is independent of X conditioned on Z; i.e. Then, we have that E[Y θln Pθ(X)] = 0. Proof. Factor baselines are valid control variates if GΣ is true to the MDP (i.e.


hzi,zii i + Ea bVi zi,Ea bVi zi, =Ea h bVi 2 hzi,zii i + E σπi(a) b Vi Eσπi(a)[zi], E σπi(a) b V i,Eσπi(a)[zi ], =Ea h

Neural Information Processing Systems

Cov[gi(s,a),gj(s,a)]. (9) The n optimal baselinesare given by the values that minimise Equation 9; i.e.b?i(s, σπi (a)) . Note that whileyi depends on the full action,xi depends only on the actions influencing the targets in [KΣψ(s,a)]i. Ingeneral,thereareveryfewmethods that can solve these type of systems, and those that can are limited to bounds of approximately d|Σ| 20. These explore the impact of the factor baseline across aset of dimensionalities and learning rates. This implies that the performance observed in the search bandit itlikely totell usabout the performance infull MDPs.