Adaptive Dialog Policy Learning with Hindsight and User Modeling