Multi-Modal Language Models as Text-to-Image Model Evaluators