5W1H Extraction With Large Language Models

Cao, Yang, Lan, Yangsong, Zhai, Feiyan, Li, Piji

May-25-2024–arXiv.org Artificial Intelligence

The extraction of essential news elements through the 5W1H framework (\textit{What}, \textit{When}, \textit{Where}, \textit{Why}, \textit{Who}, and \textit{How}) is critical for event extraction and text summarization. The advent of Large language models (LLMs) such as ChatGPT presents an opportunity to address language-related tasks through simple prompts without fine-tuning models with much time. While ChatGPT has encountered challenges in processing longer news texts and analyzing specific attributes in context, especially answering questions about \textit{What}, \textit{Why}, and \textit{How}. The effectiveness of extraction tasks is notably dependent on high-quality human-annotated datasets. However, the absence of such datasets for the 5W1H extraction increases the difficulty of fine-tuning strategies based on open-source LLMs. To address these limitations, first, we annotate a high-quality 5W1H dataset based on four typical news corpora (\textit{CNN/DailyMail}, \textit{XSum}, \textit{NYT}, \textit{RA-MDS}); second, we design several strategies from zero-shot/few-shot prompting to efficient fine-tuning to conduct 5W1H aspects extraction from the original news documents. The experimental results demonstrate that the performance of the fine-tuned models on our labelled dataset is superior to the performance of ChatGPT. Furthermore, we also explore the domain adaptation capability by testing the source-domain (e.g. NYT) models on the target domain corpus (e.g. CNN/DailyMail) for the task of 5W1H extraction.

dataset, extraction, information, (12 more...)

arXiv.org Artificial Intelligence

May-25-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States > California (0.04)
  - Mexico > Baja California (0.04)
- Europe > United Kingdom
  - England
    - Tyne and Wear > Sunderland (0.05)
    - Greater London > London (0.04)
- Asia
  - Japan (0.05)
  - India > NCT
    - New Delhi (0.04)
    - Delhi (0.04)
  - China > Jiangsu Province
    - Nanjing (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Law (0.69)
- Government (0.46)
- Leisure & Entertainment > Sports
  - Soccer (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found