Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models

Wang, Ruida, Zhou, Wangchunshu, Sachan, Mrinmaya

Oct-20-2023–arXiv.org Artificial Intelligence

Data Synthesis is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from large language models to synthesize pseudo training examples for small models, making it possible to achieve both data and compute efficiency at the same time. However, a key challenge in data synthesis is that the synthesized dataset often suffers from a large distributional discrepancy from the real task data distribution. Thus, in this paper, we propose Synthesis Step by Step (S3), a data synthesis framework that shrinks this distribution gap by iteratively extrapolating the errors made Figure 1: Training and testing accuracy of DistilBert by a small model trained on the synthesized with ZeroGen (Ye et al., 2022b) on the IMDb dataset dataset on a small real-world validation dataset with 200k training datapoints. Also shown are the training using a large language model. Extensive experiments and testing accuracy of the model trained on Gold-on multiple NLP tasks show that our Data. We can see here that ZeroGen's training accuracy approach improves the performance of a small quickly reaches nearly 100%, but testing accuracy remains model by reducing the gap between the synthetic low.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-20-2023

arXiv.org PDF

Add feedback

Country:
- North America > Canada (0.14)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Leisure & Entertainment (1.00)
- Media > Film (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found