The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only The Falcon LLM Team

Neural Information Processing Systems 

Zero-shot performance on our main-agg task aggregate (see Section 4.1 for details). At equivalent compute budgets (in PetaFLOPS-days), our models significantly outperform publicly available models trained on H The Pile, and match the performance of the GPT-3 models.