Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models