Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation

Chen, Yulong, Zhang, Huajian, Zhou, Yijie, Bai, Xuefeng, Wang, Yueguan, Zhong, Ming, Yan, Jianhao, Li, Yafu, Li, Judy, Zhu, Michael, Zhang, Yue

Jul-8-2023–arXiv.org Artificial Intelligence

Most existing cross-lingual summarization (CLS) work constructs CLS corpora by simply and directly translating pre-annotated summaries from one language to another, which can contain errors from both summarization and translation processes. To address this issue, we propose ConvSumX, a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context. ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions. We conduct thorough analysis on ConvSumX and 3 widely-used manually annotated CLS corpora and empirically find that ConvSumX is more faithful towards input text. Additionally, based on the same intuition, we propose a 2-Step method, which takes both conversation and summary as input to simulate human annotation process. Experimental results show that 2-Step method surpasses strong baselines on ConvSumX under both automatic and human evaluation. Analysis shows that both source input text and summary are crucial for modeling cross-lingual summaries.

machine learning, natural language, translation, (19 more...)

arXiv.org Artificial Intelligence

Jul-8-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York (0.04)
  - Indiana (0.04)
  - Massachusetts (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - Maine > Kennebec County
    - Waterville (0.04)
- Europe
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East
    - Jordan (0.05)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
- Africa > Ethiopia
  - Addis Ababa > Addis Ababa (0.04)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Leisure & Entertainment (1.00)
- Media > Television (0.68)
- Government > Regional Government
  - North America Government > United States Government (0.68)

Technology:
- Information Technology
  - Communications (0.67)
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Natural Language > Text Processing (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found