Testing for LLM response differences: the case of a composite null consisting of semantically irrelevant query perturbations

Open in new window