Faithful Model Evaluation for Model-Based Metrics