Reviews: Surrogate Objectives for Batch Policy Optimization in One-step Decision Making
–Neural Information Processing Systems
Summary: The main points in the paper are: -- expected reward objective has exponentially many local maxima -- smooth risk and hence, the new loss L(q, r, x) which are both calibrated can be used and L is strongly convex implying a unique global optimum. Originality: The work is original. Clarity: The paper is clear to read, except some details in the experimental section, on page 4, where the meanings of the risk R(\pi) is not described clearly. Significance and comments: First, in the new objective for contextual bandits, the authors mention that this objective is not the same as the trust-region or proximal objectives used in RL (line 237), but how does this compare with the maximum entropy RL (for example, Harrnoja et.al, Soft Q-learning and Soft actor-critic) objectives with the same policy and value function/reward models? In these maxent RL formulations, an estimator similar to Eqn 12, Page 5 is optimized.
Neural Information Processing Systems
Jan-25-2025, 06:30:26 GMT
- Technology: