Measuring the Robustness of Reference-Free Dialogue Evaluation Systems