TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design

Chen, Yongrui, Jiang, Haiyun, Huang, Xinting, Shi, Shuming, Qi, Guilin

Sep-11-2023–arXiv.org Artificial Intelligence

High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collection methods are limited by unrealistic manual labeling costs or by the hallucination of relying solely on LLM generation. To address the problems, this paper presents a scalable method to automatically collect high-quality instructional adaptation data by training language models to automatically design tasks based on human-written texts. Intuitively, human-written text helps to help the model attenuate illusions during the generation of tasks. Unlike instruction back-translation-based methods that directly take the given text as a response, we require the model to generate the \textit{instruction}, \textit{input}, and \textit{output} simultaneously to filter the noise. The results of the automated and manual evaluation experiments demonstrate the quality of our dataset.

chatgpt, dataset, instruction, (16 more...)

arXiv.org Artificial Intelligence

Sep-11-2023

arXiv.org PDF

Add feedback

Country:
- South America > Colombia
  - Meta Department > Villavicencio (0.04)
- North America
  - United States > Hawaii
    - Honolulu County > Honolulu (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe > Ireland
  - Leinster > County Dublin > Dublin (0.04)
- Asia > Middle East
  - Jordan (0.04)
  - UAE > Abu Dhabi Emirate
    - Abu Dhabi (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.99)
  - Machine Learning > Neural Networks
    - Deep Learning (0.99)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found