Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs