TurkEmbed: Turkish Embedding Model on NLI & STS Tasks

Ezerceli, Özay, Gümüşçekiçci, Gizem, Erkoç, Tuğba, Özenç, Berke

Nov-12-2025–arXiv.org Artificial Intelligence

This paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resource-constrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-b-TR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4\% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Nov-12-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East > Republic of Türkiye (0.28)

Genre:
- Research Report
  - New Finding (0.46)
  - Promising Solution (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found