RedPajama: an Open Dataset for Training Large Language Models Maurice Weber

Neural Information Processing Systems 

In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found