On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language