RedPajama: an Open Dataset for Training Large Language Models

May-27-2025, 17:56:31 GMT–Neural Information Processing Systems

Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset.

dataset, language model, redpajama, (6 more...)

Neural Information Processing Systems

May-27-2025, 17:56:31 GMT

Conferences Web Page

Add feedback

Country:
- North America > United States > Virginia (0.07)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.63)