Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages