naab: A ready-to-use plug-and-play corpus for Farsi
Sabouri, Sadra, Rahmati, Elnaz, Gooran, Soroush, Sameti, Hossein
–arXiv.org Artificial Intelligence
Huge corpora of textual data are always known to be a crucial need for training deep models such as transformer-based ones. This issue is emerging more in lower resource languages - like Farsi. We propose naab, the biggest cleaned and ready-to-use open-source textual corpus in Farsi. It contains about 130GB of data, 250 million paragraphs, and 15 billion words. The project name is derived from the Farsi word NAAB K which means pure and high grade. We also provide the raw version of the corpus called naab-raw and an easy-to-use preprocessor that can be employed by those who wanted to make a customized corpus.
arXiv.org Artificial Intelligence
Aug-29-2022
- Country:
- Asia > Middle East
- Iran > Tehran Province
- Tehran (0.05)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Iran > Tehran Province
- Europe
- Germany > Saxony
- Leipzig (0.05)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Germany > Saxony
- North America > Dominican Republic (0.04)
- Asia > Middle East
- Genre:
- Research Report (0.40)
- Industry:
- Information Technology (0.47)
- Technology: