The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only The Falcon LLMTeam
–Neural Information Processing Systems
This curation process is believed to be necessary to produce 5 performant models with broad zero-shot generalization abilities. However, as larger 6 models requiring pretraining on trillions of tokens are considered, it is unclear how 7 scalable is curation, and whether we will run out of unique high-quality data soon.
Neural Information Processing Systems
Apr-30-2026, 09:16:27 GMT
- Country:
- Asia (0.28)
- North America > United States (0.28)
- Genre:
- Research Report (0.68)
- Overview (0.46)
- Technology: