CAST: Cross-modal Alignment Similarity Test for Vision Language Models