Estimating Large Language Model Capabilities without Labeled Test Data