Evaluating Large Language Models Using Contrast Sets: An Experimental Approach