Timor-Leste
Global health's defining test
As we look back on 2025, the world experienced a year of both remarkable achievement and profound challenge in global health. Multilateralism, science and solidarity were tested as never before, underscoring a fundamental truth: International cooperation is not optional. It is essential if we are to protect and promote health for everyone, everywhere in 2026 and beyond. Perhaps the most significant milestone was the adoption by WHO Member States of the Pandemic Agreement, a landmark step towards making the world safer from future pandemics. Alongside this, amendments to the International Health Regulations came into force, including a new "pandemic emergency" alert level designed to trigger stronger global cooperation.
Unlocking the Potential of Global Human Expertise
For example, in the Pandemic Response Challenge experiment, the context consisted of data about the geographic region for which the predictions were made, e.g., historical data of COVID-19 cases and intervention policies; actions were future schedules of intervention policies for the region; and outcomes were predicted future cases of COVID-19 along with the stringency
Hierarchical Memory Organization for Wikipedia Generation
Yu, Eugene J., Zhu, Dawei, Song, Yifan, Wong, Xiangyu, Zhang, Jiebin, Shi, Wenxuan, Li, Xiaoguang, Liu, Qun, Li, Sujian
Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Cahyawijaya, Samuel, Lovenia, Holy, Moniz, Joel Ruben Antony, Wong, Tack Hwa, Farhansyah, Mohammad Rifqi, Maung, Thant Thiri, Hudi, Frederikus, Anugraha, David, Habibi, Muhammad Ravi Shulthan, Qorib, Muhammad Reza, Agarwal, Amit, Imperial, Joseph Marvin, Patel, Hitesh Laxmichand, Feliren, Vicky, Nasution, Bahrul Ilmi, Rufino, Manuel Antonio, Winata, Genta Indra, Rajagede, Rian Adam, Catalan, Carlos Rafael, Imam, Mohamed Fazli, Pattnayak, Priyaranjan, Pranida, Salsabila Zahirah, Pratama, Kevin, Bangera, Yeshil, Na-Thalang, Adisai, Monderin, Patricia Nicole, Song, Yueqi, Simon, Christian, Ng, Lynnette Hui Xian, Sapan, Richardy Lobo', Rafi, Taki Hasan, Wang, Bin, Supryadi, null, Veerakanjana, Kanyakorn, Ittichaiwong, Piyalitt, Roque, Matthew Theodore, Vincentio, Karissa, Kreangphet, Takdanai, Artkaew, Phakphum, Palgunadi, Kadek Hendrawan, Yu, Yanzhi, Hastuti, Rochana Prih, Nixon, William, Bangera, Mithil, Lim, Adrian Xuan Wei, Khine, Aye Hninn, Zhafran, Hanif Muhammad, Ferdinan, Teddy, Izzani, Audra Aurora, Singh, Ayushman, Evan, null, Krito, Jauza Akbar, Anugraha, Michael, Ilasariya, Fenal Ashokbhai, Li, Haochen, Daniswara, John Amadeo, Tjiaranata, Filbert Aurelian, Yulianrifat, Eryawan Presma, Udomcharoenchaikit, Can, Ansori, Fadil Risdian, Ihsani, Mahardika Krisna, Nguyen, Giang, Barik, Anab Maulana, Velasco, Dan John, Genadi, Rifo Ahmad, Saha, Saptarshi, Wei, Chengwei, Flores, Isaiah, Chen, Kenneth Ko Han, Santos, Anjela Gail, Lim, Wan Shen, Phyo, Kaung Si, Santos, Tim, Dwiastuti, Meisyarah, Luo, Jiayun, Cruz, Jan Christian Blaise, Hee, Ming Shan, Hanif, Ikhlasul Akmal, Hakim, M. Alif Al, Sya'ban, Muhammad Rizky, Kerdthaisong, Kun, Miranda, Lester James V., Koto, Fajri, Fatyanosa, Tirana Noor, Aji, Alham Fikri, Rosal, Jostin Jerico, Kevin, Jun, Wijaya, Robert, Kampman, Onno P., Zhang, Ruochen, Karlsson, Börje F., Limkonchotiwat, Peerat
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Susanto, Yosephine, Hulagadri, Adithya Venkatadri, Montalan, Jann Railey, Ngui, Jian Gang, Yong, Xian Bin, Leong, Weiqi, Rengarajan, Hamsawardhini, Limkonchotiwat, Peerat, Mai, Yifan, Tjhi, William Chandra
With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multicultural benchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specific capabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA) region, a comprehensive and authentic evaluation suite for the SEA languages has not been developed thus far. Here, we present SEA-HELM, a holistic linguistic and cultural LLM evaluation suite that emphasizes SEA languages, comprising five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety. SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the SEA-HELM leaderboard, which allows users to understand models' multilingual and multicultural performance in a systematic and user-friendly manner.
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
Schneider, Florian, Holtermann, Carolin, Biemann, Chris, Lauscher, Anne
Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation
Xu, Derong, Li, Xinhang, Zhang, Ziheng, Lin, Zhenxi, Zhu, Zhihong, Zheng, Zhi, Wu, Xian, Zhao, Xiangyu, Xu, Tong, Chen, Enhong
Large Language Models (LLMs) demonstrate remarkable capabilities, yet struggle with hallucination and outdated knowledge when tasked with complex knowledge reasoning, resulting in factually incorrect outputs. Previous studies have attempted to mitigate it by retrieving factual knowledge from large-scale knowledge graphs (KGs) to assist LLMs in logical reasoning and prediction of answers. However, this kind of approach often introduces noise and irrelevant data, especially in situations with extensive context from multiple knowledge aspects. In this way, LLM attention can be potentially mislead from question and relevant information. In our study, we introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework. This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings. The Amar framework comprises two key sub-components: 1) a self-alignment module that aligns commonalities among entities, relations, and subgraphs to enhance retrieved text, thereby reducing noise interference; 2) a relevance gating module that employs a soft gate to learn the relevance score between question and multi-aspect retrieved data, to determine which information should be used to enhance LLMs' output, or even filtered altogether. Our method has achieved state-of-the-art performance on two common datasets, WebQSP and CWQ, showing a 1.9\% improvement in accuracy over its best competitor and a 6.6\% improvement in logical form generation over a method that directly uses retrieved text as context prompts. These results demonstrate the effectiveness of Amar in improving the reasoning of LLMs.
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries
Liu, Shudong, Jin, Yiqiao, Li, Cheng, Wong, Derek F., Wen, Qingsong, Sun, Lichao, Chen, Haipeng, Xie, Xing, Wang, Jindong
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models' general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.
Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
Merx, Raphael, Correia, Adérito José Guterres, Suominen, Hanna, Vylomova, Ekaterina
Low-resource machine translation (MT) presents a diversity of community needs and application challenges that remain poorly understood. To complement surveys and focus groups, which tend to rely on small samples of respondents, we propose an observational study on actual usage patterns of a specialized MT service for the Tetun language, which is the lingua franca in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate text from a high-resource language into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts, in the high-resource to low-resource direction. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.