7 AppendixA

Neural Information Processing Systems 

The conclusion of this lemma is that is it suffices to minimize the empirical0-1 loss under the learner'sownone-trajectoryempirical state distribution 1N Thenthevariableπt( |s)liesinthesimplex 1A andthevectorzit(s)isco-ordinatewise 0and 1. Tolearn the sequence ofpolicies returned by the learner,we use the normalized-EG algorithm of [28] which is also known as Follow-the-regularized-leader / Online Mirror Descent with entropy regularization for online learning. Formally,the online learning problem and the algorithm are as definedinSection2of[28]. Theorem8(AdaptedfromTheorem2.22in[28]). Ateach states S,choosing theexpert'saction renews the state in the initial distributionρproviding a reward of1(except at statebit provides a reward of0), while every other action deterministically transitions the learner to the bad state and providesnoreward. At the states unvisited in the datasetD, the learner cannot infer the expert's policy or even the transitions induced under different actions. Intuitively,the learner cannot guess the expert'saction with probability 1/|A| at such states, a statement which we prove by leveraging the Bayesian construction.

Duplicate Docs Excel Report

Similar Docs  Excel Report  more

TitleSimilaritySource
None found