SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Lovenia, Holy, Mahendra, Rahmad, Akbar, Salsabil Maulana, Miranda, Lester James V., Santoso, Jennifer, Aco, Elyanah, Fadhilah, Akhdan, Mansurov, Jonibek, Imperial, Joseph Marvin, Kampman, Onno P., Moniz, Joel Ruben Antony, Habibi, Muhammad Ravi Shulthan, Hudi, Frederikus, Montalan, Railey, Ignatius, Ryan, Lopo, Joanito Agili, Nixon, William, Karlsson, Börje F., Jaya, James, Diandaru, Ryandito, Gao, Yuze, Amadeus, Patrick, Wang, Bin, Cruz, Jan Christian Blaise, Whitehouse, Chenxi, Parmonangan, Ivan Halim, Khelli, Maria, Zhang, Wenyu, Susanto, Lucky, Ryanda, Reynard Adha, Hermawan, Sonny Lazuardi, Velasco, Dan John, Kautsar, Muhammad Dehan Al, Hendria, Willy Fitra, Moslem, Yasmin, Flynn, Noah, Adilazuarda, Muhammad Farid, Li, Haochen, Lee, Johanes, Damanhuri, R., Sun, Shuo, Qorib, Muhammad Reza, Djanibekov, Amirbek, Leong, Wei Qi, Do, Quyet V., Muennighoff, Niklas, Pansuwan, Tanrada, Putra, Ilham Firdausi, Xu, Yan, Tai, Ngee Chia, Purwarianti, Ayu, Ruder, Sebastian, Tjhi, William, Limkonchotiwat, Peerat, Aji, Alham Fikri, Keh, Sedrick, Winata, Genta Indra, Zhang, Ruochen, Koto, Fajri, Yong, Zheng-Xin, Cahyawijaya, Samuel
–arXiv.org Artificial Intelligence
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
arXiv.org Artificial Intelligence
Jul-8-2024
- Country:
- South America > Brazil (0.04)
- Oceania
- Australia > Queensland (0.04)
- Fiji > Eastern Division
- Levuka (0.04)
- North America
- Dominican Republic (0.04)
- United States
- Washington > King County
- Seattle (0.04)
- Texas > Dallas County
- Dallas (0.04)
- Washington > King County
- Canada
- Europe
- Monaco (0.04)
- Sweden (0.04)
- Albania > Tirana County
- Tirana (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany > Saxony
- Leipzig (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Middle East
- Malta > Eastern Region
- Northern Harbour District > St. Julian's (0.04)
- Cyprus > Nicosia
- Nicosia (0.04)
- Malta > Eastern Region
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Netherlands > South Holland
- Leiden (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Southeast Asia (0.24)
- Laos (0.06)
- Cambodia (0.05)
- Timor-Leste (0.05)
- Singapore (0.05)
- East Asia (0.04)
- India (0.04)
- Uzbekistan (0.04)
- Malaysia > Penang (0.04)
- Thailand
- China
- Myanmar > Chin State
- Hakha (0.04)
- Philippines
- Middle East
- Israel (0.04)
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Indonesia
- Bali (0.04)
- East Nusa Tenggara > Kupang (0.04)
- Sumatra
- West Sumatra > Padang (0.04)
- Aceh (0.04)
- Sulawesi
- West Sulawesi > Mamuju (0.04)
- South Sulawesi > Makassar (0.04)
- North Sulawesi > Manado (0.04)
- Gorontalo > Gorontalo (0.04)
- Vietnam
- Hanoi > Hanoi (0.04)
- Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- Haiphong > Haiphong (0.04)
- Brunei
- Japan > Honshū
- Tōhoku (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.14)
- Africa > Middle East
- Genre:
- Research Report (0.81)
- Industry:
- Education (0.68)
- Information Technology (0.67)
- Energy (0.45)
- Technology:
- Information Technology
- Communications > Social Media (1.00)
- Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Speech > Speech Recognition (0.92)
- Natural Language
- Text Processing (1.00)
- Machine Translation (1.00)
- Large Language Model (1.00)
- Chatbot (0.69)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Information Technology