Benchmarking is Broken -- Don't Let AI be its Own Judge