Adapting Vision-Language Models for Evaluating World Models