Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Schmidtová, Patrícia, Mahamood, Saad, Balloccu, Simone, Dušek, Ondřej, Gatt, Albert, Gkatzia, Dimitra, Howcroft, David M., Plátek, Ondřej, Sivaprasad, Adarsa

arXiv.org Artificial Intelligence 

There is now a Given the well-documented shortcomings of automatic significant body of contributions presenting experimental metrics, our goal in this paper is to survey research, meta-analyses and/or best practice the current state of play in metric-based evaluations guidelines, on issues ranging from statistical significance of natural language generation (NLG). As with the testing (Dror and Reichart, 2018), to human above-mentioned studies focusing on other facets evaluation methods (Howcroft et al., 2020a; van der of evaluation, we aim to both understand how metrics Lee et al., 2021; Hämäläinen and Alnajjar, 2021; are currently used in NLG, and to identify gaps Shimorina and Belz, 2022a), error analysis (van and possible ways forward in an effort to improve Miltenburg et al., 2021a, 2023) and replicability of the scientific quality of NLG research.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found