Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation