What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities

Open in new window