Trust but Verify: Programmatic VLM Evaluation in the Wild

Open in new window