Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features

Hlavnova, Ester, Ruder, Sebastian

arXiv.org Artificial Intelligence 

We use M2C to generate tests that probe models' behavior in light of specific linguistic features in 12 typologically diverse languages. We evaluate state-of-the-art language models on the generated tests. While models excel at most tests in English, we highlight Figure 1: Top: Comparison of state-of-the-art models generalization failures to specific typological on M2C tests in a selected set of languages. Models characteristics such as temporal expressions perform well on English but poorly on certain tests in in Swahili and compounding possessives other languages. Bottom: Even the largest models fail in Finish. Our findings motivate the development on tests probing language-specific features, e.g., the distinction of models that address these blind spots.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found