Metaphor and Large Language Models: When Surface Features Matter More than Deep Understanding