ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Foroutan, Negar, Saydaliev, Jakhongir, Kim, Ye Eun, Bosselut, Antoine

Jun-19-2025–arXiv.org Artificial Intelligence

Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Jun-19-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)
- Asia > Middle East (0.28)

Genre:
- Research Report > New Finding (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.66)
  - Machine Learning
    - Performance Analysis > Accuracy (0.68)
    - Neural Networks (0.68)
    - Inductive Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found