Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench

Open in new window