Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval
Yang, Jinrui, Baldwin, Timothy, Cohn, Trevor
–arXiv.org Artificial Intelligence
We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.
arXiv.org Artificial Intelligence
Nov-3-2023
- Country:
- North America
- Dominican Republic (0.04)
- United States > New York
- New York County > New York City (0.04)
- Europe
- Belgium (0.05)
- Ireland (0.04)
- Bulgaria (0.04)
- Poland (0.04)
- Germany (0.04)
- Netherlands (0.04)
- Denmark (0.04)
- Finland (0.04)
- Slovakia (0.04)
- Slovenia (0.04)
- France (0.04)
- Italy (0.04)
- Greece (0.04)
- Latvia (0.04)
- Lithuania (0.04)
- Estonia (0.04)
- Romania (0.04)
- Croatia (0.04)
- Sweden (0.04)
- Czechia (0.04)
- United Kingdom (0.04)
- Hungary (0.04)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- Alicante Province > Alicante (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Middle East
- Asia
- North America
- Genre:
- Research Report (1.00)
- Industry:
- Government > Regional Government (0.46)
- Technology: