GenQA: Generating Millions of Instructions from a Handful of Prompts

Chen, Jiuhai, Qadri, Rifaa, Wen, Yuxin, Jain, Neel, Kirchenbauer, John, Zhou, Tianyi, Goldstein, Tom

Jun-14-2024–arXiv.org Artificial Intelligence

Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models. To study questions about finetuning at scale, such as curricula and learning rate cooldown schedules, there is a need for industrial-scale datasets. However, this scale necessitates a data generation process that is almost entirely automated. In this work, we study methods for generating large instruction datasets from a single prompt. With little human oversight, we get LLMs to write diverse sets of instruction examples ranging from simple completion tasks to complex multi-turn dialogs across a variety of subject areas. When finetuning a Llama-3 8B base model, our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations. We release our dataset, the "generator" prompts that created it, and our finetuned model checkpoints.

dataset, instruction, question and answer, (14 more...)

arXiv.org Artificial Intelligence

Jun-14-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- North America > United States
  - Maryland (0.04)
  - New York
    - Richmond County > New York City (0.04)
    - Queens County > New York City (0.04)
    - New York County > New York City (0.04)
    - Kings County > New York City (0.04)
    - Bronx County > New York City (0.04)
  - California > Los Angeles County
    - Los Angeles (0.14)

Genre:
- Personal (0.92)
- Research Report (0.81)
- Instructional Material > Training Manual (0.46)

Industry:
- Media (1.00)
- Leisure & Entertainment (1.00)
- Health & Medicine (1.00)
- Banking & Finance (1.00)
- Information Technology > Security & Privacy (0.67)
- Law
  - Litigation (1.00)
  - Intellectual Property & Technology Law (0.67)
  - Environmental Law (0.67)
- Government > Regional Government
  - North America Government > United States Government (0.93)
- Education > Curriculum
  - Subject-Specific Education (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found