With AI models clobbering every benchmark, it's time for human evaluation

Open in new window