Review for NeurIPS paper: Conservative Q-Learning for Offline Reinforcement Learning
–Neural Information Processing Systems
The theoretical claims of producing lower bounds on Q values are not sufficient since there is no proof that the conservative Q values are anywhere near the true Q values. Just estimating Q 0 for positive rewards could give the same result at theorems 3.1, 3.2, and 3.3. Clearly, the algorithm is doing something smarter than this, but the current analysis does not characterize what the algorithm is doing. The gap expanding result is likely the strongest of the four theorems, but without doing the work to connect this back to why this will actually help performance it is still difficult to judge. Moreover, no comparison is made to the overestimation that would happen without the proposed algorithmic change.
Neural Information Processing Systems
Jan-21-2025, 12:26:27 GMT
- Technology: