Law
Large Language Model as Attributed Training Data Generator: A T ale of Diversity and Bias Yue Y u
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g.,
1 Datasheet for QM1B
As recommended by the NeurIPS dataset and benchmark track, we documented QM1B and intended uses through the Datasheets for Datasets framework [1]. The goal of dataset datasheets as outlined by [1] is to provide a standardized process for documentating datasets. The authors of [1] present a list of carefully selected questions which dataset authors should answer. We hope our answers to these questions will facilitate better communication between us (the dataset creators) and future users of QM1B. For what purpose was the dataset created? Prior gaussian-based Density Functional Theory (DFT) datasets contained fewer than 20 million training examples.