Scalable Performance Analysis for Vision-Language Models