Multi-label Scandinavian Language Identification (SLIDE)

Fedorova, Mariia, Frydenberg, Jonas Sebulon, Handford, Victoria, Langø, Victoria Ovedie Chruickshank, Willoch, Solveig Helene, Midtgaard, Marthe Løken, Scherrer, Yves, Mæhlum, Petter, Samuel, David

arXiv.org Artificial Intelligence 

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokm\r{a}l, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.