The BigScience ROOTS Corpus: A1.6TB Composite Multilingual Dataset
–Neural Information Processing Systems
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings.
Neural Information Processing Systems
Nov-16-2025, 04:52:52 GMT
- Country:
- Africa > Niger (0.04)
- Asia
- Afghanistan > Parwan Province
- Charikar (0.04)
- China > Beijing
- Beijing (0.04)
- Indonesia (0.04)
- Japan
- Honshū
- Chūbu > Toyama Prefecture
- Toyama (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.04)
- Chūbu > Toyama Prefecture
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Singapore (0.04)
- Vietnam (0.04)
- Afghanistan > Parwan Province
- Europe
- Ireland (0.04)
- United Kingdom > Scotland
- City of Edinburgh > Edinburgh (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Italy
- Calabria > Catanzaro Province
- Catanzaro (0.04)
- Liguria > Genoa (0.04)
- Trentino-Alto Adige/Südtirol > Trentino Province
- Trento (0.04)
- Tuscany > Florence (0.04)
- Calabria > Catanzaro Province
- Slovenia (0.04)
- Norway (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Spain
- Basque Country (0.04)
- Catalonia > Barcelona Province
- Barcelona (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany
- Iceland > Capital Region
- Reykjavik (0.04)
- North America
- Canada > Quebec
- Montreal (0.04)
- Dominican Republic (0.04)
- United States > Michigan
- Washtenaw County > Ann Arbor (0.04)
- Canada > Quebec
- Oceania > Australia
- Victoria > Melbourne (0.04)
- Western Australia (0.04)
- Industry:
- Health & Medicine > Therapeutic Area (0.67)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.46)
- Natural Language
- Large Language Model (1.00)
- Machine Translation (0.93)
- Text Processing (1.00)
- Machine Learning > Neural Networks
- Communications > Social Media (1.00)
- Data Science (1.00)
- Information Management (1.00)
- Artificial Intelligence
- Information Technology