Can We Predict Performance of Large Models across Vision-Language Tasks?

Open in new window