AITopics | multi-document summarization

Collaborating Authors

multi-document summarization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Dahan, Noam, Kidron, Omer, Stanovsky, Gabriel

arXiv.org Artificial IntelligenceNov-19-2025

High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2511.14598

Country:

Europe (1.00)
Asia > Middle East > Israel (0.93)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry:

Media > News (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback

Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization

Li, Chuyuan, Xu, Austin, Joty, Shafiq, Carenini, Giuseppe

arXiv.org Artificial IntelligenceSep-15-2025

A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS.

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.09852

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

Zhang, Yongbing, Nan, Fang, Gao, Shengxiang, Huang, Yuxin, Tan, Kaiwen, Yu, Zhengtao

arXiv.org Artificial IntelligenceAug-1-2025

The core challenge faced by multi-document summarization is the complexity of relationships among documents and the presence of information redundancy. Graph clustering is an effective paradigm for addressing this issue, as it models the complex relationships among documents using graph structures and reduces information redundancy through clustering, achieving significant research progress. However, existing methods often only consider single-relational graphs and require a predefined number of clusters, which hinders their ability to fully represent rich relational information and adaptively partition sentence groups to reduce redundancy. To overcome these limitations, we propose MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization. Specifically, we construct a multi-relational graph that integrates semantic and discourse relations between sentences, comprehensively modeling the intricate and dynamic connections among sentences across documents. We then apply a two-dimensional structural entropy minimization algorithm for clustering, automatically determining the optimal number of clusters and effectively organizing sentences into coherent groups. Finally, we introduce a position-aware compression mechanism to distill each cluster, generating concise and informative summaries. Extensive experiments on four benchmark datasets (Multi-News, DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently outperforms previous unsupervised methods and, in several cases, achieves performance comparable to supervised models and large language models. Human evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high consistency and coverage, approaching human-level quality.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.234

Country: Asia > China (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

A Unified Retrieval Framework with Document Ranking and EDU Filtering for Multi-document Summarization

Tan, Shiyin, Park, Jaeeon, Li, Dongyuan, Jiang, Renhe, Okumura, Manabu

arXiv.org Artificial IntelligenceApr-24-2025

In the field of multi-document summarization (MDS), transformer-based models have demonstrated remarkable success, yet they suffer an input length limitation. Current methods apply truncation after the retrieval process to fit the context length; however, they heavily depend on manually well-crafted queries, which are impractical to create for each document set for MDS. Additionally, these methods retrieve information at a coarse granularity, leading to the inclusion of irrelevant content. To address these issues, we propose a novel retrieval-based framework that integrates query selection and document ranking and shortening into a unified process. Our approach identifies the most salient elementary discourse units (EDUs) from input documents and utilizes them as latent queries. These queries guide the document ranking by calculating relevance scores. Instead of traditional truncation, our approach filters out irrelevant EDUs to fit the context length, ensuring that only critical information is preserved for summarization. We evaluate our framework on multiple MDS datasets, demonstrating consistent improvements in ROUGE metrics while confirming its scalability and flexibility across diverse model architectures. Additionally, we validate its effectiveness through an in-depth analysis, emphasizing its ability to dynamically select appropriate queries and accurately rank documents based on their relevance scores. These results demonstrate that our framework effectively addresses context-length constraints, establishing it as a robust and reliable solution for MDS.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.16711

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.15)

Genre: Research Report > New Finding (0.66)

Industry: Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

Survey on Abstractive Text Summarization: Dataset, Models, and Metrics

Nnadi, Gospel Ozioma, Bertini, Flavio

arXiv.org Artificial IntelligenceDec-22-2024

Readers and scholars often desire a concise summary (Too Long; Didn't Read - TL;DR) of texts to effectively prioritize information. However, creating document summaries is mentally taxing and time-consuming, especially considering the overwhelming volume of documents produced annually, as depicted in Figure 1 by [2], Figure 2, [3] reported over 100,000 scientific articles on the Corona virus pandemic in 2020, though these articles contain brief abstracts of the article, the sheer volume poses challenges for researchers and medical professionals in quickly extracting relevant knowledge on a specific topic. An automatically generated multi-document summarization could be valuable, providing readers with essential information and reducing the need to access original files unless refinement is necessary. Text summarization has garnered significant research attention, proving useful in search engines, news clustering, timeline generation, and various other applications. The objective of text summarization is to create a brief, coherent, factually consistent, and readable document that retains the essential information from the source document, whether it is a single or multi-document. In Single Document Summarization (SDS) only one input document is used, eliminating the need for additional processing to assess relationships between inputs. This method is suitable for summarizing standalone documents such as emails, legal contracts, financial reports and so on. The primary goal of Multi Document Summarization (MDS) is to gather information from several texts addressing the same topic, often composed at different times or representing diverse perspectives. The overarching objective is to produce information reports that are both succinct and comprehensive, consolidating varied opinions from documents that explore a topic through multiple viewpoints.

evolutionary algorithm, information retrieval, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2412.17165

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Law (1.00)
Media > News (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
(2 more...)

Add feedback

A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model

Gao, Shengxiang, nan, Fang, Zhang, Yongbing, Huang, Yuxin, Tan, Kaiwen, Yu, Zhengtao

arXiv.org Artificial IntelligenceOct-13-2024

Existing research on news summarization primarily focuses on single-language singledocument (SLSD), single-language multidocument (SLMD) or cross-language singledocument (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack Figure 1: The diagram of SLSD, SLMD, CLSD and of datasets for MLMD news summarization has MLMD. Each rounded rectangle represents a source constrained the development of research in this document, while the pointed rectangle represents the area. To fill this gap, we construct a mixedlanguage target summary. "En" "De" "Fr" and "Es" indicate that multi-document news summarization the text is in English, German, French, and Spanish, dataset (MLMD-news), which contains four different respectively.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.09773

Country:

Asia > China > Yunnan Province > Kunming (0.05)
Europe > France > Île-de-France > Paris > Paris (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Leveraging Long-Context Large Language Models for Multi-Document Understanding and Summarization in Enterprise Applications

Godbole, Aditi, George, Jabin Geevarghese, Shandilya, Smita

arXiv.org Artificial IntelligenceSep-27-2024

The rapid increase in unstructured data across various fields has made multi-document comprehension and summarization a critical task. Traditional approaches often fail to capture relevant context, maintain logical consistency, and extract essential information from lengthy documents. This paper explores the use of Long-context Large Language Models (LLMs) for multi-document summarization, demonstrating their exceptional capacity to grasp extensive connections, provide cohesive summaries, and adapt to various industry domains and integration with enterprise applications/systems. The paper discusses the workflow of multi-document summarization for effectively deploying long-context LLMs, supported by case studies in legal applications, enterprise functions such as HR, finance, and sourcing, as well as in the medical and news domains. These case studies show notable enhancements in both efficiency and accuracy. Technical obstacles, such as dataset diversity, model scalability, and ethical considerations like bias mitigation and factual accuracy, are carefully analyzed. Prospective research avenues are suggested to augment the functionalities and applications of long-context LLMs, establishing them as pivotal tools for transforming information processing across diverse sectors and enterprise applications.

llm, proceedings, summarization, (10 more...)

arXiv.org Artificial Intelligence

2409.18454

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > Dominican Republic (0.04)
(16 more...)

Genre:

Research Report > Promising Solution (0.46)
Research Report > Experimental Study (0.34)

Industry:

Law (1.00)
Health & Medicine (1.00)
Media > News (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

GLIMMER: Incorporating Graph and Lexical Features in Unsupervised Multi-Document Summarization

Liu, Ran, Liu, Ming, Yu, Min, Jiang, Jianguo, Li, Gang, Zhang, Dan, Li, Jingyuan, Meng, Xiang, Huang, Weiqing

arXiv.org Artificial IntelligenceAug-19-2024

Pre-trained language models are increasingly being used in multi-document summarization tasks. However, these models need large-scale corpora for pre-training and are domain-dependent. Other non-neural unsupervised summarization approaches mostly rely on key sentence extraction, which can lead to information loss. To address these challenges, we propose a lightweight yet effective unsupervised approach called GLIMMER: a Graph and LexIcal features based unsupervised Multi-docuMEnt summaRization approach. It first constructs a sentence graph from the source documents, then automatically identifies semantic clusters by mining low-level features from raw texts, thereby improving intra-cluster correlation and the fluency of generated sentences. Finally, it summarizes clusters into natural sentences. Experiments conducted on Multi-News, Multi-XScience and DUC-2004 demonstrate that our approach outperforms existing unsupervised approaches. Furthermore, it surpasses state-of-the-art pre-trained multi-document summarization models (e.g. PEGASUS and PRIMERA) under zero-shot settings in terms of ROUGE scores. Additionally, human evaluations indicate that summaries generated by GLIMMER achieve high readability and informativeness scores. Our code is available at https://github.com/Oswald1997/GLIMMER.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2408.10115

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Summarization of Investment Reports Using Pre-trained Model

Sakaji, Hiroki, Kobayashi, Ryotaro, Izumi, Kiyoshi, Mitsugi, Hiroyuki, Kuramoto, Wataru

arXiv.org Artificial IntelligenceAug-3-2024

In this paper, we attempt to summarize monthly reports as investment reports. Fund managers have a wide range of tasks, one of which is the preparation of investment reports. In addition to preparing monthly reports on fund management, fund managers prepare management reports that summarize these monthly reports every six months or once a year. The preparation of fund reports is a labor-intensive and time-consuming task. Therefore, in this paper, we tackle investment summarization from monthly reports using transformer-based models. There are two main types of summarization methods: extractive summarization and abstractive summarization, and this study constructs both methods and examines which is more useful in summarizing investment reports.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/IIAI-AAI59060.2023.00111

2408.01744

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.17)
Europe > Italy > Tuscany > Florence (0.05)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(10 more...)

Genre: Research Report (0.82)

Industry:

Banking & Finance > Trading (1.00)
Banking & Finance > Economy (1.00)
Government > Regional Government > North America Government > United States Government (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

M2DS: Multilingual Dataset for Multi-document Summarisation

Hewapathirana, Kushan, de Silva, Nisansa, Athuraliya, C. D.

arXiv.org Artificial IntelligenceJul-17-2024

In the rapidly evolving digital era, there is an increasing demand for concise information as individuals seek to distil key insights from various sources. Recent attention from researchers on Multi-document Summarisation (MDS) has resulted in diverse datasets covering customer reviews, academic papers, medical and legal documents, and news articles. However, the English-centric nature of these datasets has created a conspicuous void for multilingual datasets in today's globalised digital landscape, where linguistic diversity is celebrated. Media platforms such as British Broadcasting Corporation (BBC) have disseminated news in 20+ languages for decades. With only 380 million people speaking English natively as their first language, accounting for less than 5% of the global population, the vast majority primarily relies on other languages. These facts underscore the need for inclusivity in MDS research, utilising resources from diverse languages. Recognising this gap, we present the Multilingual Dataset for Multi-document Summarisation (M2DS), which, to the best of our knowledge, is the first dataset of its kind. It includes document-summary pairs in five languages from BBC articles published during the 2010-2023 period. This paper introduces M2DS, emphasising its unique multilingual aspect, and includes baseline scores from state-of-the-art MDS models evaluated on our dataset.

dataset, multilingual dataset, summarization, (14 more...)

arXiv.org Artificial Intelligence

2407.12336

Country: Asia > Sri Lanka (0.05)

Genre: Research Report > New Finding (0.46)

Industry: Media (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback