The BigScience ROOTS Corpus: A1.6TB Composite Multilingual Dataset

Neural Information Processing Systems 

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found