FuLG: 150B Romanian Corpus for Language Model Pretraining
Bădoiu, Vlad-Andrei, Dumitru, Mihai-Valentin, Gherghescu, Alexandru M., Agache, Alexandru, Raiciu, Costin
–arXiv.org Artificial Intelligence
Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.
arXiv.org Artificial Intelligence
Jul-18-2024
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Bulgaria (0.04)
- Czechia > Moravian-Silesian Region
- Ostrava (0.04)
- Finland (0.05)
- Romania > București - Ilfov Development Region
- Municipality of Bucharest > Bucharest (0.07)
- Asia > Middle East
- Genre:
- Research Report (0.64)
- Technology: