Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics