Trust but Verify: Programmatic VLM Evaluation in the Wild