a766f56d2da42cae20b5652970ec04ef-Supplemental-Conference.pdf

Neural Information Processing Systems 

The critic is learned by regressing theλ-returns via a squared loss, wheresg() indicates stopping the gradientaroundthetargets: L(v) .