7 Appendix A

Neural Information Processing Systems 

's vector (where the dimension is inferred from context). Recall from eq. (2), the population expected 0-1 loss of a policy π is defined as L( f N null (19) = min null µ, µ|S| H N null, (20) where the last inequality uses [22, Theorem 6.1]. This concludes the proof of Theorem 2. 7.3 Proof of Theorem 3 Theorem 10. In particular, the learner exactly knows the expert's policy The expert policy is deterministic in the lower bound instances we construct. First the expert's policy is sampled uniformly from Intuitively, the learner cannot guess the expert's action Since the bad state is never observed in the dataset, the learner is forced to guess the expert's action Using [22, Lemma A.14], the conditional distribution of the expert's policy given the expert dataset D can be characterized.

Duplicate Docs Excel Report

Similar Docs  Excel Report  more

TitleSimilaritySource
None found