Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
Dahan, Noam, Kidron, Omer, Stanovsky, Gabriel
–arXiv.org Artificial Intelligence
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.
arXiv.org Artificial Intelligence
Nov-19-2025
- Country:
- Asia > Middle East
- Israel (0.93)
- Europe (1.00)
- Asia > Middle East
- Genre:
- Research Report
- New Finding (0.46)
- Promising Solution (0.34)
- Research Report
- Industry:
- Health & Medicine (1.00)
- Media > News (1.00)
- Technology: