Does mBERT understand Romansh? Evaluating word embeddings using word alignment
–arXiv.org Artificial Intelligence
We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on parallel sentences in German and Romansh. Since Romansh is an unseen language, we are dealing with a zero-shot setting. Using embeddings from mBERT, both models reach an alignment error rate of 0.22, which outperforms fast_align, a statistical model, and is on par with similarity-based word alignment for seen languages. We interpret these results as evidence that mBERT contains information that can be meaningful and applicable to Romansh. To evaluate performance, we also present a new trilingual corpus, which we call the DERMIT (DE-RM-IT) corpus, containing press releases made by the Canton of Grisons in German, Romansh and Italian in the past 25 years. The corpus contains 4 547 parallel documents and approximately 100 000 sentence pairs in each language combination. We additionally present a gold standard for German-Romansh word alignment. The data is available at https://github.com/eyldlv/DERMIT-Corpus.
arXiv.org Artificial Intelligence
Aug-17-2023
- Country:
- North America > United States
- Washington > King County
- Seattle (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Washington > King County
- Europe
- Belgium (0.04)
- Czechia > Prague (0.04)
- Switzerland > Zürich
- Zürich (0.05)
- Iceland > Capital Region
- Reykjavik (0.04)
- Italy > Tuscany
- Florence (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Sweden > Östergötland County
- Linköping (0.05)
- Netherlands > South Holland
- Dordrecht (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Asia
- China > Hong Kong (0.04)
- Middle East
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Qatar > Ad-Dawhah
- Doha (0.04)
- Republic of Türkiye > Istanbul Province
- Africa > Middle East
- Egypt > Giza Governorate > Giza (0.04)
- North America > United States
- Genre:
- Research Report (0.50)
- Technology: