NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Cahyawijaya, Samuel, Lovenia, Holy, Aji, Alham Fikri, Winata, Genta Indra, Wilie, Bryan, Mahendra, Rahmad, Wibisono, Christian, Romadhony, Ade, Vincentio, Karissa, Koto, Fajri, Santoso, Jennifer, Moeljadi, David, Wirawan, Cahya, Hudi, Frederikus, Parmonangan, Ivan Halim, Alfina, Ika, Wicaksono, Muhammad Satrio, Putra, Ilham Firdausi, Rahmadani, Samsul, Oenang, Yulianti, Septiandri, Ali Akbar, Jaya, James, Dhole, Kaustubh D., Suryani, Arie Ardiyanti, Putri, Rifki Afina, Su, Dan, Stevens, Keith, Nityasya, Made Nindyatama, Adilazuarda, Muhammad Farid, Ignatius, Ryan, Diandaru, Ryandito, Yu, Tiezheng, Ghifari, Vito, Dai, Wenliang, Xu, Yan, Damapuspita, Dyah, Tho, Cuk, Karo, Ichwanul Muslim Karo, Fatyanosa, Tirana Noor, Ji, Ziwei, Fung, Pascale, Neubig, Graham, Baldwin, Timothy, Ruder, Sebastian, Sujaini, Herry, Sakti, Sakriani, Purwarianti, Ayu
–arXiv.org Artificial Intelligence
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv.org Artificial Intelligence
Jul-21-2023
- Country:
- Africa
- Eritrea > Maekel
- Asmara (0.04)
- Madagascar (0.04)
- Middle East > Egypt
- Giza Governorate > Giza (0.04)
- Eritrea > Maekel
- Asia
- Brunei (0.04)
- Malaysia (0.04)
- Macao (0.04)
- Vietnam (0.04)
- Japan
- Honshū
- Chūgoku > Hiroshima Prefecture
- Hiroshima (0.04)
- Kantō
- Ibaraki Prefecture > Tsukuba (0.04)
- Tokyo Metropolis Prefecture > Tokyo (0.04)
- Chūgoku > Hiroshima Prefecture
- Kyūshū & Okinawa > Kyūshū
- Kumamoto Prefecture > Kumamoto (0.04)
- Honshū
- Indonesia
- Bali (0.04)
- Borneo > Kalimantan
- East Kalimantan > Nusantara (0.04)
- South Kalimantan (0.04)
- West Kalimantan > Pontianak (0.04)
- Java
- Nusa Tenggara Islands (0.04)
- Sulawesi
- South Sulawesi (0.04)
- Southeast Sulawesi (0.04)
- West Sulawesi (0.04)
- Sumatra
- Aceh (0.04)
- Bengkulu > Bengkulu (0.04)
- Jambi > Jambi (0.04)
- Lampung (0.14)
- North Sumatra (0.04)
- West Sumatra (0.04)
- West Nusa Tenggara (0.04)
- Middle East
- Israel (0.04)
- Jordan (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Philippines (0.04)
- Russia (0.04)
- China
- Guangdong Province > Shantou (0.04)
- Hong Kong (0.04)
- Sichuan Province (0.04)
- Timor-Leste (0.14)
- South Korea (0.04)
- Myanmar (0.04)
- Southeast Asia (0.04)
- Singapore (0.04)
- Europe
- Hungary
- Budapest > Budapest (0.04)
- Jász-Nagykun-Szolnok County > Szolnok (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Russia > Northwestern Federal District
- Leningrad Oblast > Saint Petersburg (0.04)
- Ukraine > Kyiv Oblast
- Kyiv (0.04)
- Switzerland (0.04)
- Netherlands
- North Holland > Amsterdam (0.04)
- South Holland > Dordrecht (0.04)
- Spain (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Italy > Tuscany
- Florence (0.04)
- Albania > Tirana County
- Tirana (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Hungary
- North America
- Canada (0.04)
- Dominican Republic (0.04)
- United States
- Iowa (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Texas > Dallas County
- Dallas (0.14)
- Washington > King County
- Seattle (0.04)
- Oceania
- Australia (0.04)
- Papua New Guinea (0.04)
- Africa
- Genre:
- Research Report > New Finding (0.45)
- Industry:
- Education > Educational Setting (0.67)
- Government (0.67)
- Health & Medicine > Therapeutic Area (0.46)
- Information Technology > Services (0.67)
- Law (0.67)
- Media > News (0.45)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Learning Graphical Models (0.92)
- Neural Networks > Deep Learning (0.92)
- Natural Language
- Grammars & Parsing (1.00)
- Information Extraction (0.93)
- Large Language Model (1.00)
- Machine Translation (1.00)
- Text Processing (1.00)
- Speech > Speech Recognition (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence