20min-XD: A Comparable Corpus of Swiss News Articles
Wastl, Michelle, Vamvas, Jannis, Calleri, Selena, Sennrich, Rico
–arXiv.org Artificial Intelligence
We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments.
arXiv.org Artificial Intelligence
May-1-2025
- Country:
- Asia
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.14)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Middle East > UAE
- Europe
- North America
- Canada (0.04)
- Dominican Republic (0.04)
- United States > Florida
- Miami-Dade County > Miami (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Technology: