The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only The Falcon LLMTeam

Neural Information Processing Systems 

This curation process is believed to be necessary to produce 5 performant models with broad zero-shot generalization abilities. However, as larger 6 models requiring pretraining on trillions of tokens are considered, it is unclear how 7 scalable is curation, and whether we will run out of unique high-quality data soon.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found