Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias Yue Yu

May-25-2025, 08:47:17 GMT–Neural Information Processing Systems

Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

May-25-2025, 08:47:17 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.67)
- North America > United States (1.00)

Genre:
- Personal (0.67)
- Research Report > New Finding (0.92)

Industry:
- Consumer Products & Services > Restaurants (0.68)
- Media
  - Film (1.00)
  - News (0.94)
  - Television (0.67)
- Education (1.00)
- Banking & Finance > Economy (1.00)
- Government
  - Military (0.67)
  - Regional Government > North America Government
    - United States Government (1.00)
- Health & Medicine
  - Consumer Health (1.00)
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area (1.00)
- Law (1.00)
- Information Technology
  - Security & Privacy (1.00)
  - Services (0.67)
- Energy > Renewable (0.67)
- Leisure & Entertainment
  - Games > Computer Games (1.00)
  - Sports > Baseball (0.67)
- Social Sector (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.96)
  - Natural Language > Large Language Model (1.00)