A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model

Gao, Shengxiang, nan, Fang, Zhang, Yongbing, Huang, Yuxin, Tan, Kaiwen, Yu, Zhengtao

arXiv.org Artificial Intelligence 

Existing research on news summarization primarily focuses on single-language singledocument (SLSD), single-language multidocument (SLMD) or cross-language singledocument (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack Figure 1: The diagram of SLSD, SLMD, CLSD and of datasets for MLMD news summarization has MLMD. Each rounded rectangle represents a source constrained the development of research in this document, while the pointed rectangle represents the area. To fill this gap, we construct a mixedlanguage target summary. "En" "De" "Fr" and "Es" indicate that multi-document news summarization the text is in English, German, French, and Spanish, dataset (MLMD-news), which contains four different respectively.