7 AppendixA

Feb-7-2026, 10:25:34 GMT–Neural Information Processing Systems

The conclusion of this lemma is that is it suffices to minimize the empirical0-1 loss under the learner'sownone-trajectoryempirical state distribution 1N Thenthevariableπt( |s)liesinthesimplex 1A andthevectorzit(s)isco-ordinatewise 0and 1. Tolearn the sequence ofpolicies returned by the learner,we use the normalized-EG algorithm of [28] which is also known as Follow-the-regularized-leader / Online Mirror Descent with entropy regularization for online learning. Formally,the online learning problem and the algorithm are as definedinSection2of[28]. Theorem8(AdaptedfromTheorem2.22in[28]). Ateach states S,choosing theexpert'saction renews the state in the initial distributionρproviding a reward of1(except at statebit provides a reward of0), while every other action deterministically transitions the learner to the bad state and providesnoreward. At the states unvisited in the datasetD, the learner cannot infer the expert's policy or even the transitions induced under different actions. Intuitively,the learner cannot guess the expert'saction with probability 1/|A| at such states, a statement which we prove by leveraging the Bayesian construction.

artificial intelligence, machine learning, sd 2, (15 more...)

Neural Information Processing Systems

Feb-7-2026, 10:25:34 GMT

Conferences PDF

Add feedback

Industry:
- Education (0.74)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.67)

Duplicate Docs Excel Report

Title
09dbc1177211571ef3e1ca961cc39363-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found