Goto

Collaborating Authors

 coling 2022


COLING 2022 Highlights

#artificialintelligence

Recent metrics for natural language generation rely on pre-trained language models, for instance BERTScore, BLEURT, and COMET. These metrics achieve a high correlation with human evaluations on standard benchmarks. However, it is unclear how these metrics perform for styles and domains that aren't well represented in their training data. In other words, are these metrics robust? The authors found that BERTScore isn't robust to character-level perturbations.