When LLMs get significantly worse: A statistical approach to detect model degradations

Open in new window