Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples
Michail, Andrianos, Clematide, Simon, Sennrich, Rico
–arXiv.org Artificial Intelligence
The evaluation of cross-lingual semantic search capabilities of models is often limited to existing datasets from tasks such as information retrieval and semantic textual similarity. To allow for domain-specific evaluation, we introduce Cross Lingual Semantic Discrimination (CLSD), a novel cross-lingual semantic search task that requires only a set of parallel sentence pairs of the language pair of interest within the target domain. This task focuses on the ability of a model to cross-lingually rank the true parallel sentence higher than hard negatives generated by a large language model. We create four instances of our introduced CLSD task for the language pair German-French within the domain of news. Within this case study, we find that models that are also fine-tuned for retrieval tasks (e.g., multilingual E5) benefit from using English as the pivot language, while bitext mining models such as LaBSE perform best directly cross-lingually. We also show a fine-grained similarity analysis enabled by our distractor generation strategy, indicating that different embedding models are sensitive to different types of perturbations.
arXiv.org Artificial Intelligence
Feb-12-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Middle East > Israel (0.04)
- Europe
- Croatia (0.04)
- Germany > Baden-Württemberg
- Karlsruhe Region > Karlsruhe (0.04)
- Stuttgart Region > Stuttgart (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Tuscany
- Florence (0.04)
- Spain > Galicia
- Madrid (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New York (0.04)
- Florida > Miami-Dade County
- Canada
- Asia
- Genre:
- Research Report (0.64)
- Technology: