Can We Predict Performance of Large Models across Vision-Language Tasks?