How OpenAI stress-tests its large language models