The impact of internal variability on benchmarking deep learning climate emulators