Automatic benchmarking of large multimodal models via iterative experiment programming