Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data