MENLI: Robust Evaluation Metrics from Natural Language Inference