Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Open in new window