Guided Dialog Policy Learning without Adversarial Learning in the Loop