Review for NeurIPS paper: Multi-task Batch Reinforcement Learning with Metric Learning

Neural Information Processing Systems 

Weaknesses: The main weakness of the method is a reliance on accurate relabelling. The paper argues that actor-critic networks got casually confused due to (almost) disjoint task distributions and then hopes that reward models will not have the same problem. However, it seems that the problem also affects reward models, as a reward ensemble is used in the experiments. There is no ablation study to investigate the necessity of this ensemble in the offline setting. Can you explain why you did not use the setting from 5.1 and 5.2 to evaluate this component of your model?