Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies