ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

Park, Jeiyoon, Park, Chanjun, Lim, Heuiseok

Jun-11-2024–arXiv.org Artificial Intelligence

We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named ChatLang-8, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT's data generation capabilities.

chatlang-8, computational linguistic, dataset, (10 more...)

arXiv.org Artificial Intelligence

Jun-11-2024

arXiv.org PDF

Add feedback

Country:
- Pacific Ocean > North Pacific Ocean
  - San Francisco Bay > Golden Gate (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - New York (0.04)
    - Maryland > Baltimore (0.04)
    - Florida > Orange County (0.04)
    - Washington > King County
      - Seattle (0.04)
    - Oregon > Multnomah County
      - Portland (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Georgia > Fulton County
      - Atlanta (0.04)
    - California > San Francisco County
      - San Francisco (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Italy > Tuscany
    - Florence (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)
- Asia
  - South Korea (0.04)
  - China (0.04)
  - India > Maharashtra
    - Mumbai (0.04)

Genre:
- Research Report > New Finding (0.48)

Industry:
- Media (0.68)
- Leisure & Entertainment (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Grammars & Parsing (0.95)
  - Machine Learning > Neural Networks
    - Deep Learning (0.51)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found