LR-Sum: Summarization for Less-Resourced Languages

Palen-Michel, Chester, Lignos, Constantine

Oct-26-2023–arXiv.org Artificial Intelligence

This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe how we plan to use the data for modeling experiments and discuss limitations of the dataset.

computational linguistic, dataset, lr-sum, (15 more...)

arXiv.org Artificial Intelligence

Oct-26-2023

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - New York (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Ukraine > Sumy Oblast
    - Sumy (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Czechia > South Moravian Region
    - Brno (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Cambodia (0.04)
  - Vietnam > Hanoi
    - Hanoi (0.04)
  - Middle East
    - Israel (0.04)
    - Iraq (0.04)
  - Japan > Hokkaidō
    - Hokkaidō Prefecture > Sapporo (0.04)
  - China > Beijing
    - Beijing (0.04)
- Africa
  - Zimbabwe (0.04)
  - Zambia (0.04)
  - Namibia (0.04)
  - Botswana (0.04)
  - Angola (0.04)

Genre:
- Research Report (0.40)

Industry:
- Information Technology (0.66)
- Government > Regional Government
  - North America Government > United States Government (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Text Processing (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found