Review for NeurIPS paper: Near-Optimal Reinforcement Learning with Self-Play
–Neural Information Processing Systems
Additional Feedback: *) Is there a reason to mention algorithm 1? it seems algorithm 2 gives improved performance relatively to it. If so, why presenting the two algorithms and not just algorithm 2? *) Although equation 9 can be thought of as a set of n m linear constraints, why the optimization problem is always feasible? Although the authors devoted half a page to explain on this procedure, I feel it is not well explained. Most of the discussion is not devoted to explaining the policy certification procedure. Why for a fixed \mu the best response is not markovian?
Neural Information Processing Systems
Jan-22-2025, 01:41:51 GMT
- Technology: