Bangsamoro
FilBench: Can LLMs Understand and Generate Filipino?
Miranda, Lester James V., Aco, Elyanah, Manuel, Conner, Cruz, Jan Christian Blaise, Imperial, Joseph Marvin
Despite the impressive performance of LLMs on English-based tasks, little is known about their capabilities in specific languages such as Filipino. In this work, we address this gap by introducing FilBench, a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano. We carefully curate the tasks in FilBench to reflect the priorities and trends of NLP research in the Philippines such as Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation. By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities. Our results indicate that FilBench is challenging, with the best model, GPT-4o, achieving only a score of 72.23%. Moreover, we also find that models trained specifically for Southeast Asian languages tend to underperform on FilBench, with the highest-performing model, SEA-LION v3 70B, achieving only a score of 61.07%. Our work demonstrates the value of curating language-specific LLM benchmarks to aid in driving progress on Filipino NLP and increasing the inclusion of Philippine languages in LLM development.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Serbia > Central Serbia > Belgrade (0.04)
- Asia > Southeast Asia (0.04)
- (16 more...)
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Lovenia, Holy, Mahendra, Rahmad, Akbar, Salsabil Maulana, Miranda, Lester James V., Santoso, Jennifer, Aco, Elyanah, Fadhilah, Akhdan, Mansurov, Jonibek, Imperial, Joseph Marvin, Kampman, Onno P., Moniz, Joel Ruben Antony, Habibi, Muhammad Ravi Shulthan, Hudi, Frederikus, Montalan, Railey, Ignatius, Ryan, Lopo, Joanito Agili, Nixon, William, Karlsson, Börje F., Jaya, James, Diandaru, Ryandito, Gao, Yuze, Amadeus, Patrick, Wang, Bin, Cruz, Jan Christian Blaise, Whitehouse, Chenxi, Parmonangan, Ivan Halim, Khelli, Maria, Zhang, Wenyu, Susanto, Lucky, Ryanda, Reynard Adha, Hermawan, Sonny Lazuardi, Velasco, Dan John, Kautsar, Muhammad Dehan Al, Hendria, Willy Fitra, Moslem, Yasmin, Flynn, Noah, Adilazuarda, Muhammad Farid, Li, Haochen, Lee, Johanes, Damanhuri, R., Sun, Shuo, Qorib, Muhammad Reza, Djanibekov, Amirbek, Leong, Wei Qi, Do, Quyet V., Muennighoff, Niklas, Pansuwan, Tanrada, Putra, Ilham Firdausi, Xu, Yan, Tai, Ngee Chia, Purwarianti, Ayu, Ruder, Sebastian, Tjhi, William, Limkonchotiwat, Peerat, Aji, Alham Fikri, Keh, Sedrick, Winata, Genta Indra, Zhang, Ruochen, Koto, Fajri, Yong, Zheng-Xin, Cahyawijaya, Samuel
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
- Asia > Southeast Asia (0.24)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > Laos (0.06)
- (59 more...)
- Education (0.68)
- Information Technology (0.67)
- Energy (0.45)
A quantitative and typological study of Early Slavic participle clauses and their competition
This thesis is a corpus-based, quantitative, and typological analysis of the functions of Early Slavic participle constructions and their finite competitors ($jegda$-'when'-clauses). The first part leverages detailed linguistic annotation on Early Slavic corpora at the morphosyntactic, dependency, information-structural, and lexical levels to obtain indirect evidence for different potential functions of participle clauses and their main finite competitor and understand the roles of compositionality and default discourse reasoning as explanations for the distribution of participle constructions and $jegda$-clauses in the corpus. The second part uses massively parallel data to analyze typological variation in how languages express the semantic space of English $when$, whose scope encompasses that of Early Slavic participle constructions and $jegda$-clauses. Probabilistic semantic maps are generated and statistical methods (including Kriging, Gaussian Mixture Modelling, precision and recall analysis) are used to induce cross-linguistically salient dimensions from the parallel corpus and to study conceptual variation within the semantic space of the hypothetical concept WHEN.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.27)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.13)
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.13)
- (75 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Media (0.92)
- Leisure & Entertainment (0.67)
Calibrating Long-form Generations from Large Language Models
Huang, Yukun, Liu, Yixin, Thirukovalluru, Raghuveer, Cohan, Arman, Dhingra, Bhuwan
To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores. Within this framework, we develop three metrics to precisely evaluate LLM calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. Our experiments, which include long-form QA and summarization tasks, demonstrate that larger models don't necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. Lastly, we showcase a practical application of our system: selecting and cascading open-source models and ChatGPT to optimize correctness given a limited API budget. This research not only challenges existing notions of LLM calibration but also offers practical methodologies for improving trustworthiness in long-form generation.
- Asia > Singapore (0.04)
- Asia > Philippines > Luzon > Central Luzon > Province of Tarlac > City of Tarlac (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (7 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
- Media > Music (0.67)
- Leisure & Entertainment (0.67)
GlotLID: Language Identification for Low-Resource Languages
Kargaran, Amir Hossein, Imani, Ayyoob, Yvon, François, Schütze, Hinrich
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
- North America > Mexico > Puebla (0.04)
- (84 more...)
- Media > Television (0.45)
- Health & Medicine > Therapeutic Area > Neurology (0.33)
MegaWika: Millions of reports and their sources across 50 diverse languages
Barham, Samuel, Weller, Orion, Yuan, Michelle, Murray, Kenton, Yarmohammadi, Mahsa, Jiang, Zhengping, Vashishtha, Siddharth, Martin, Alexander, Liu, Anqi, White, Aaron Steven, Boyd-Graber, Jordan, Van Durme, Benjamin
To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Philippines > Mindanao > Bangsamoro > Province of Maguindanao del Norte > City of Cotabato (0.05)
- Africa > Togo > Maritime Region > Lome (0.04)
- (8 more...)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.68)
Building Machine Translation Systems for the Next Thousand Languages
Bapna, Ankur, Caswell, Isaac, Kreutzer, Julia, Firat, Orhan, van Esch, Daan, Siddhant, Aditya, Niu, Mengmeng, Baljekar, Pallavi, Garcia, Xavier, Macherey, Wolfgang, Breiner, Theresa, Axelrod, Vera, Riesa, Jason, Cao, Yuan, Chen, Mia Xu, Macherey, Klaus, Krikun, Maxim, Wang, Pidong, Gutkin, Alexander, Shah, Apurva, Huang, Yanping, Chen, Zhifeng, Wu, Yonghui, Hughes, Macduff
In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.13)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- North America > Mexico > Puebla (0.04)
- (68 more...)
- Media (0.67)
- Health & Medicine (0.67)
- Education (0.46)
- Leisure & Entertainment (0.45)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Beheaded in Philadelphia, punched in Silicon Valley and smeared with barbecue sauce in San Francisco: Why do humans hurt robots?
A hitchhiking robot was beheaded in Philadelphia. A security robot was punched to the ground in Silicon Valley. Another security bot, in San Francisco, was covered in a tarp and smeared with barbecue sauce. Why do people lash out at robots, particularly those built to resemble humans? It is a global phenomenon. In a mall in Osaka, Japan, three boys beat a humanoid robot with all their strength. In Moscow, a man attacked a teaching robot named Alantim with a baseball bat, kicking it to the ground, while the robot pleaded for help.
- North America > United States > California > San Francisco County > San Francisco (0.60)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.60)
- Europe > Germany (0.30)
- (47 more...)
- Law (1.00)
- Information Technology (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- (8 more...)
Militants in southern Philippines free Norwegian hostage
Abu Sayyaf extremists on Saturday freed a Norwegian man kidnapped a year ago in the southern Philippines with two Canadians who were later beheaded and a Filipino woman who has been released by the ransom-seeking militants, officials said. Kjartan Sekkingstad was freed in Patikul town in Sulu province and was eventually secured by rebels from the larger Moro National Liberation Front, which has signed a peace deal with the government and helped negotiate his release, Philippine government officials said. Sekkingstad, held in jungle captivity since being kidnapped last September, was to stay overnight at the house of Moro National Liberation Front chairman Nur Misuari in Sulu and then be flown to the southern city of Davao on Sunday to meet with Philippine President Rodrigo Duterte, said Jesus Dureza, who advises Duterte on peace talks with insurgent groups. A plan to fly the freed hostage out of Sulu, a jungle-clad Muslim region about 590 miles south of Manila, on Saturday was scrapped because of bad weather, Dureza said. Dureza said that when he spoke on the phone with Sekkingstad, the Norwegian expressed his gratitude to Duterte.
- Asia > Philippines > Mindanao > Bangsamoro > Province of Sulu (0.25)
- Asia > Philippines > Luzon > National Capital Region > City of Manila (0.25)
- North America > Canada (0.16)
- (2 more...)