20min-XD: A Comparable Corpus of Swiss News Articles

Wastl, Michelle, Vamvas, Jannis, Calleri, Selena, Sennrich, Rico

May-1-2025–arXiv.org Artificial Intelligence

We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

May-1-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland (0.68)
- Asia > Middle East
  - UAE (0.28)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Media > News (0.86)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found