Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation