Our Evaluation Metric Needs an Update to Encourage Generalization