News Summarization and Evaluation in the Era of GPT-3

Goyal, Tanya, Li, Junyi Jessy, Durrett, Greg

May-23-2023–arXiv.org Artificial Intelligence

The recent success of prompting large language models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries. Finally, we evaluate models on a setting beyond generic summarization, specifically keyword-based summarization, and show how dominant fine-tuning approaches compare to prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and prompt-based models across 4 standard summarization benchmarks, (b) 1K human preference judgments comparing different systems for generic- and keyword-based summarization.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

May-23-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Russia (0.68)
- Europe
  - Ukraine (0.70)
  - United Kingdom > England (0.28)
- North America
  - Canada (0.93)
  - United States
    - Louisiana (0.28)
    - Missouri (0.28)
    - New York (0.46)
    - Texas (0.28)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Banking & Finance (1.00)
- Consumer Products & Services (0.67)
- Education (1.00)
- Government
  - Military (1.00)
  - Regional Government
    - Europe Government (0.68)
    - North America Government > United States Government (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Law > Criminal Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Media (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found