A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability