Automatic benchmarking of large multimodal models via iterative experiment programming

Open in new window