Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Open in new window