0e915db6326b6fb6a3c56546980a8c93-Supplemental.pdf
–Neural Information Processing Systems
Let B be the maximum difference betweenU1t and U2t, and let (π,θ1,θ2) be a Nash Equilibrium forG. Let π1 be the best response to the first teacher (with utilityU1t) and let π1+2 be the best response policy to the joint teacher. This result shows that as we reduce the number of random episodes, the approximation to aminimax regret strategy improves. Let G be the dual curriculum game in which the first teacher maximizes regret, so U1t = URt, and the second teacher plays randomly, soU2t = UUt . Finally,we need to show thatπ2+3 isoptimal for the student.
Neural Information Processing Systems
Feb-7-2026, 12:16:50 GMT
- Country:
- Asia
- Europe
- North America
- Mexico (0.05)
- United States (0.05)
- Oceania > Australia (0.05)
- South America > Brazil (0.05)
- Genre:
- Research Report > New Finding (0.48)
- Industry:
- Leisure & Entertainment > Sports > Motorsports > Formula One (0.46)
- Technology: