Off-Policy Reinforcement Learning with High Dimensional Reward