NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Cahyawijaya, Samuel, Lovenia, Holy, Aji, Alham Fikri, Winata, Genta Indra, Wilie, Bryan, Mahendra, Rahmad, Wibisono, Christian, Romadhony, Ade, Vincentio, Karissa, Koto, Fajri, Santoso, Jennifer, Moeljadi, David, Wirawan, Cahya, Hudi, Frederikus, Parmonangan, Ivan Halim, Alfina, Ika, Wicaksono, Muhammad Satrio, Putra, Ilham Firdausi, Rahmadani, Samsul, Oenang, Yulianti, Septiandri, Ali Akbar, Jaya, James, Dhole, Kaustubh D., Suryani, Arie Ardiyanti, Putri, Rifki Afina, Su, Dan, Stevens, Keith, Nityasya, Made Nindyatama, Adilazuarda, Muhammad Farid, Ignatius, Ryan, Diandaru, Ryandito, Yu, Tiezheng, Ghifari, Vito, Dai, Wenliang, Xu, Yan, Damapuspita, Dyah, Tho, Cuk, Karo, Ichwanul Muslim Karo, Fatyanosa, Tirana Noor, Ji, Ziwei, Fung, Pascale, Neubig, Graham, Baldwin, Timothy, Ruder, Sebastian, Sujaini, Herry, Sakti, Sakriani, Purwarianti, Ayu
–arXiv.org Artificial Intelligence
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv.org Artificial Intelligence
Jul-21-2023
- Country:
- Oceania
- Papua New Guinea (0.04)
- Australia (0.04)
- North America
- Dominican Republic (0.04)
- Canada (0.04)
- United States
- Iowa (0.04)
- Washington > King County
- Seattle (0.04)
- Texas > Dallas County
- Dallas (0.14)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Europe
- Spain (0.04)
- Switzerland (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Albania > Tirana County
- Tirana (0.04)
- Italy > Tuscany
- Florence (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Netherlands
- South Holland > Dordrecht (0.04)
- North Holland > Amsterdam (0.04)
- Ukraine > Kyiv Oblast
- Kyiv (0.04)
- Russia > Northwestern Federal District
- Leningrad Oblast > Saint Petersburg (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Hungary
- Jász-Nagykun-Szolnok County > Szolnok (0.04)
- Budapest > Budapest (0.04)
- Asia
- Timor-Leste (0.14)
- Myanmar (0.04)
- Singapore (0.04)
- Southeast Asia (0.04)
- South Korea (0.04)
- Russia (0.04)
- Philippines (0.04)
- Vietnam (0.04)
- Macao (0.04)
- Malaysia (0.04)
- Brunei (0.04)
- China
- Hong Kong (0.04)
- Sichuan Province (0.04)
- Guangdong Province > Shantou (0.04)
- Middle East
- Jordan (0.04)
- Israel (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Indonesia
- Bali (0.04)
- West Nusa Tenggara (0.04)
- Nusa Tenggara Islands (0.04)
- Sumatra
- Lampung (0.14)
- North Sumatra (0.04)
- West Sumatra (0.04)
- Jambi > Jambi (0.04)
- Bengkulu > Bengkulu (0.04)
- Aceh (0.04)
- Sulawesi
- West Sulawesi (0.04)
- Southeast Sulawesi (0.04)
- South Sulawesi (0.04)
- Java
- Borneo > Kalimantan
- East Kalimantan > Nusantara (0.04)
- West Kalimantan > Pontianak (0.04)
- South Kalimantan (0.04)
- Japan
- Kyūshū & Okinawa > Kyūshū
- Kumamoto Prefecture > Kumamoto (0.04)
- Honshū
- Kantō
- Tokyo Metropolis Prefecture > Tokyo (0.04)
- Ibaraki Prefecture > Tsukuba (0.04)
- Chūgoku > Hiroshima Prefecture
- Hiroshima (0.04)
- Kantō
- Kyūshū & Okinawa > Kyūshū
- Africa
- Madagascar (0.04)
- Middle East > Egypt
- Giza Governorate > Giza (0.04)
- Eritrea > Maekel
- Asmara (0.04)
- Oceania
- Genre:
- Research Report > New Finding (0.45)
- Industry:
- Law (0.67)
- Government (0.67)
- Information Technology > Services (0.67)
- Education > Educational Setting (0.67)
- Health & Medicine > Therapeutic Area (0.46)
- Media > News (0.45)
- Technology:
- Information Technology > Artificial Intelligence
- Speech > Speech Recognition (1.00)
- Natural Language
- Text Processing (1.00)
- Machine Translation (1.00)
- Large Language Model (1.00)
- Grammars & Parsing (1.00)
- Information Extraction (0.93)
- Machine Learning
- Neural Networks > Deep Learning (0.92)
- Learning Graphical Models (0.92)
- Information Technology > Artificial Intelligence