Do Large Language Models Judge Error Severity Like Humans?