Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features
Hlavnova, Ester, Ruder, Sebastian
–arXiv.org Artificial Intelligence
We use M2C to generate tests that probe models' behavior in light of specific linguistic features in 12 typologically diverse languages. We evaluate state-of-the-art language models on the generated tests. While models excel at most tests in English, we highlight Figure 1: Top: Comparison of state-of-the-art models generalization failures to specific typological on M2C tests in a selected set of languages. Models characteristics such as temporal expressions perform well on English but poorly on certain tests in in Swahili and compounding possessives other languages. Bottom: Even the largest models fail in Finish. Our findings motivate the development on tests probing language-specific features, e.g., the distinction of models that address these blind spots.
arXiv.org Artificial Intelligence
Jul-11-2023
- Country:
- North America
- Dominican Republic (0.04)
- United States
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Santa Fe County
- Europe
- Asia > Middle East
- UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America
- Genre:
- Research Report > New Finding (0.34)
- Technology: