FuLG: 150B Romanian Corpus for Language Model Pretraining

Bădoiu, Vlad-Andrei, Dumitru, Mihai-Valentin, Gherghescu, Alexandru M., Agache, Alexandru, Raiciu, Costin

Jul-18-2024–arXiv.org Artificial Intelligence

Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.

arxiv preprint arxiv, dataset, fulg, (10 more...)

arXiv.org Artificial Intelligence

Jul-18-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Jordan (0.04)
- Europe
  - Bulgaria (0.04)
  - Czechia > Moravian-Silesian Region
    - Ostrava (0.04)
  - Finland (0.05)
  - Romania > București - Ilfov Development Region
    - Municipality of Bucharest > Bucharest (0.07)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Chatbot (0.47)
    - Large Language Model (0.71)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found