Personalizing Task-oriented Dialog Systems via Zero-shot Generalizable Reward Function