Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval
Yang, Jinrui, Baldwin, Timothy, Cohn, Trevor
–arXiv.org Artificial Intelligence
We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.
arXiv.org Artificial Intelligence
Nov-3-2023
- Country:
- Europe > Spain
- Valencian Community (0.14)
- North America > United States (0.28)
- Europe > Spain
- Genre:
- Research Report (1.00)
- Industry:
- Government > Regional Government > Europe Government (1.00)
- Technology: