Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models

Pasch, Stefan, Petridis, Dimitirios, Cutura, Jannic

Jun-19-2024–arXiv.org Artificial Intelligence

This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse). The two-step approach achieved a higher F1 score (82% vs. 60%), particularly with less widely used languages, which can be increased up to 89% by leveraging expert rules based on domain knowledge. We also highlight limitations related to token length constraints and computational efficiency. Our methodology suggests improvements for future multilingual deduplication tasks.

dataset, duplicate, multilingual, (16 more...)

arXiv.org Artificial Intelligence

Jun-19-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Germany > Hesse > Darmstadt Region > Frankfurt (0.05)

Genre:
- Research Report (0.82)

Industry:
- Government (0.48)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found