Testing for LLM response differences: the case of a composite null consisting of semantically irrelevant query perturbations