0e915db6326b6fb6a3c56546980a8c93-Supplemental.pdf

Neural Information Processing Systems 

Let B be the maximum difference betweenU1t and U2t, and let (π,θ1,θ2) be a Nash Equilibrium forG. Let π1 be the best response to the first teacher (with utilityU1t) and let π1+2 be the best response policy to the joint teacher. This result shows that as we reduce the number of random episodes, the approximation to aminimax regret strategy improves. Let G be the dual curriculum game in which the first teacher maximizes regret, so U1t = URt, and the second teacher plays randomly, soU2t = UUt . Finally,we need to show thatπ2+3 isoptimal for the student.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found