TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design
Chen, Yongrui, Jiang, Haiyun, Huang, Xinting, Shi, Shuming, Qi, Guilin
–arXiv.org Artificial Intelligence
High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collection methods are limited by unrealistic manual labeling costs or by the hallucination of relying solely on LLM generation. To address the problems, this paper presents a scalable method to automatically collect high-quality instructional adaptation data by training language models to automatically design tasks based on human-written texts. Intuitively, human-written text helps to help the model attenuate illusions during the generation of tasks. Unlike instruction back-translation-based methods that directly take the given text as a response, we require the model to generate the \textit{instruction}, \textit{input}, and \textit{output} simultaneously to filter the noise. The results of the automated and manual evaluation experiments demonstrate the quality of our dataset.
arXiv.org Artificial Intelligence
Sep-11-2023
- Country:
- Asia > Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Europe > Ireland
- Leinster > County Dublin > Dublin (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- United States > Hawaii
- Honolulu County > Honolulu (0.04)
- Canada > Ontario
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.46)
- Technology: