Multi-label Scandinavian Language Identification (SLIDE)

Fedorova, Mariia, Frydenberg, Jonas Sebulon, Handford, Victoria, Langø, Victoria Ovedie Chruickshank, Willoch, Solveig Helene, Midtgaard, Marthe Løken, Scherrer, Yves, Mæhlum, Petter, Samuel, David

Feb-10-2025–arXiv.org Artificial Intelligence

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokm\r{a}l, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-10-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Research Report > Promising Solution (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks (0.68)
    - Statistical Learning (0.68)
  - Natural Language > Large Language Model (0.47)