When LLMs get significantly worse: A statistical approach to detect model degradations