On the Blind Spots of Model-Based Evaluation Metrics for Text Generation