Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
Dahan, Noam, Kidron, Omer, Stanovsky, Gabriel
–arXiv.org Artificial Intelligence
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.
arXiv.org Artificial Intelligence
Nov-19-2025
- Country:
- Asia
- Indonesia > Bali (0.04)
- Middle East > Israel
- Central District > Ramla (0.04)
- Haifa District > Haifa (0.04)
- Jerusalem District > Jerusalem (0.04)
- Southern District > Eilat (0.04)
- Tel Aviv District > Tel Aviv (0.04)
- Europe
- North America > United States
- New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia
- Genre:
- Research Report
- New Finding (0.46)
- Promising Solution (0.34)
- Research Report
- Industry:
- Health & Medicine (1.00)
- Media > News (1.00)
- Technology: