Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs