Scalable Policy Evaluation with Video World Models