Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples

Michail, Andrianos, Clematide, Simon, Sennrich, Rico

Feb-12-2025–arXiv.org Artificial Intelligence

The evaluation of cross-lingual semantic search capabilities of models is often limited to existing datasets from tasks such as information retrieval and semantic textual similarity. To allow for domain-specific evaluation, we introduce Cross Lingual Semantic Discrimination (CLSD), a novel cross-lingual semantic search task that requires only a set of parallel sentence pairs of the language pair of interest within the target domain. This task focuses on the ability of a model to cross-lingually rank the true parallel sentence higher than hard negatives generated by a large language model. We create four instances of our introduced CLSD task for the language pair German-French within the domain of news. Within this case study, we find that models that are also fine-tuned for retrieval tasks (e.g., multilingual E5) benefit from using English as the pivot language, while bitext mining models such as LaBSE perform best directly cross-lingually. We also show a fine-grained similarity analysis enabled by our distractor generation strategy, indicating that different embedding models are sensitive to different types of perturbations.

computational linguistic, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

Feb-12-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - New York (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Florida > Miami-Dade County
      - Miami (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Croatia (0.04)
  - United Kingdom > England
    - Greater London > London (0.04)
  - Switzerland > Zürich
    - Zürich (0.04)
  - Spain > Galicia
    - Madrid (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Germany > Baden-Württemberg
    - Stuttgart Region > Stuttgart (0.04)
    - Karlsruhe Region > Karlsruhe (0.04)
- Asia
  - Middle East > Israel (0.04)
  - China > Hong Kong (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Natural Language
  - Text Processing (1.00)
  - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found