Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications