SEA-BED: Southeast Asia Embedding Benchmark
Ponwitayarat, Wuttikorn, Ng, Raymond, Montalan, Jann Railey, Aung, Thura, Ngui, Jian Gang, Susanto, Yosephine, Tjhi, William, Tasawong, Panuthep, Cambria, Erik, Chuangsuwanich, Ekapol, Nutanong, Sarana, Limkonchotiwat, Peerat
–arXiv.org Artificial Intelligence
Sentence embeddings are essential for NLP tasks such as semantic search, re-ranking, and textual similarity. Although multilingual benchmarks like MMTEB broaden coverage, Southeast Asia (SEA) datasets are scarce and often machine-translated, missing native linguistic properties. With nearly 700 million speakers, the SEA region lacks a region-specific embedding benchmark. We introduce SEA-BED, the first large-scale SEA embedding benchmark with 169 datasets across 9 tasks and 10 languages, where 71% are formulated by humans, not machine generation or translation. We address three research questions: (1) which SEA languages and tasks are challenging, (2) whether SEA languages show unique performance gaps globally, and (3) how human vs. machine translations affect evaluation. We evaluate 17 embedding models across six studies, analyzing task and language challenges, cross-benchmark comparisons, and translation trade-offs. Results show sharp ranking shifts, inconsistent model performance among SEA languages, and the importance of human-curated datasets for low-resource languages like Burmese.
arXiv.org Artificial Intelligence
Aug-26-2025
- Country:
- Africa > Eritrea
- Asia
- Malaysia (0.04)
- East Asia (0.04)
- Singapore > Central Region
- Singapore (0.04)
- Japan
- Honshū
- Kansai > Osaka Prefecture
- Osaka (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.14)
- Kansai > Osaka Prefecture
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Southeast Asia (0.61)
- Thailand > Bangkok
- Bangkok (0.04)
- Vietnam > Hanoi
- Hanoi (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Indonesia > Java
- Europe
- Albania > Tirana County
- Tirana (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany > Bavaria
- Upper Bavaria > Ingolstadt (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy
- Calabria > Catanzaro Province
- Catanzaro (0.04)
- Tuscany > Florence (0.04)
- Calabria > Catanzaro Province
- Albania > Tirana County
- North America
- Canada > British Columbia
- Vancouver (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.14)
- Oregon > Multnomah County
- Portland (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Canada > British Columbia
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (0.67)
- Health & Medicine (1.00)
- Technology: