A Scalable Framework for Evaluating Health Language Models