hzi
hzi,zii i + Ea bVi zi,Ea bVi zi, =Ea h bVi 2 hzi,zii i + E σπi(a) b Vi Eσπi(a)[zi], E σπi(a) b V i,Eσπi(a)[zi ], =Ea h
Cov[gi(s,a),gj(s,a)]. (9) The n optimal baselinesare given by the values that minimise Equation 9; i.e.b?i(s, σπi (a)) . Note that whileyi depends on the full action,xi depends only on the actions influencing the targets in [KΣψ(s,a)]i. Ingeneral,thereareveryfewmethods that can solve these type of systems, and those that can are limited to bounds of approximately d|Σ| 20. These explore the impact of the factor baseline across aset of dimensionalities and learning rates. This implies that the performance observed in the search bandit itlikely totell usabout the performance infull MDPs.