What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities