javanese
ASR Under Noise: Exploring Robustness for Sundanese and Javanese
Pranida, Salsabila Zahirah, Airlangga, Muhammad Cendekia, Genadi, Rifo Ahmad, Shehata, Shady
We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements
- North America > United States > Texas > Dallas County > Dallas (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Iowa (0.04)
- (4 more...)
LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages
Aji, Alham Fikri, Cohn, Trevor
As one of the world's most populous countries, with 700 languages spoken, Indonesia is behind in terms of NLP progress. We introduce LoraxBench, a benchmark that focuses on low-resource languages of Indonesia and covers 6 diverse tasks: reading comprehension, open-domain QA, language inference, causal reasoning, translation, and cultural QA. Our dataset covers 20 languages, with the addition of two formality registers for three languages. We evaluate a diverse set of multilingual and region-focused LLMs and found that this benchmark is challenging. We note a visible discrepancy between performance in Indonesian and other languages, especially the low-resource ones. There is no clear lead when using a region-specific model as opposed to the general multilingual model. Lastly, we show that a change in register affects model performance, especially with registers not commonly found in social media, such as high-level politeness `Krama' Javanese.
- Asia > Indonesia > Bali (0.04)
- Asia > Indonesia > Sulawesi > Gorontalo > Gorontalo (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (27 more...)
Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings
--In this paper, we investigate the transferability of pre-trained language models to low-resource Indonesian local languages through the task of sentiment analysis. We evaluate both zero-shot performance and adapter-based transfer on ten local languages using models of different types: a monolingual Indonesian BERT, multilingual models such as mBERT and XLM-R, and a modular adapter-based approach called MAD-X. T o better understand model behavior, we group the target languages into three categories: seen (included during pre-training), partially seen (not included but linguistically related to seen languages), and unseen (absent and unrelated in pre-training data). Our results reveal clear performance disparities across these groups: multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages. We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language. Additionally, we conduct a further analysis on tokenization and show that while subword fragmentation and vocabulary overlap with Indonesian correlate weakly with prediction quality, they do not fully explain the observed performance. Instead, the most consistent predictor of transfer success is the model's prior exposure to the language, either directly or through a related language.
Extracting and Emulsifying Cultural Explanation to Improve Multilingual Capability of LLMs
Large Language Models (LLMs) have achieved remarkable success, but their English-centric training data limits performance in non-English languages, highlighting the need for enhancements in their multilingual capabilities. While some work on multilingual prompting methods handles non-English queries by utilizing English translations or restructuring them to more closely align with LLM reasoning patterns, these works often overlook the importance of cultural context, limiting their effectiveness. To address this limitation, we propose EMCEI, a simple yet effective approach that improves LLMs' multilingual capabilities by incorporating cultural context for more accurate and appropriate responses. Specifically, EMCEI follows a two-step process that first extracts relevant cultural context from the LLM's parametric knowledge via prompting. Then, EMCEI employs an LLM-as-Judge mechanism to select the most appropriate response by balancing cultural relevance and reasoning ability. Experiments on diverse multilingual benchmarks show that EMCEI outperforms existing baselines, demonstrating its effectiveness in handling multilingual queries with LLMs.
- Asia > Vietnam (0.04)
- South America > Peru (0.04)
- South America > Brazil (0.04)
- (35 more...)
- Leisure & Entertainment > Sports (0.46)
- Government (0.46)
- Energy (0.46)
Do Language Models Understand Honorific Systems in Javanese?
Farhansyah, Mohammad Rifqi, Darmawan, Iwan, Kusumawardhana, Adryan, Winata, Genta Indra, Aji, Alham Fikri, Wijaya, Derry Tanti
The Javanese language features a complex system of honorifics that vary according to the social status of the speaker, listener, and referent. Despite its cultural and linguistic significance, there has been limited progress in developing a comprehensive corpus to capture these variations for natural language processing (NLP) tasks. In this paper, we present Unggah-Ungguh, a carefully curated dataset designed to encapsulate the nuances of Unggah-Ungguh Basa, the Javanese speech etiquette framework that dictates the choice of words and phrases based on social hierarchy and context. Using Unggah-Ungguh, we assess the ability of language models (LMs) to process various levels of Javanese honorifics through classification and machine translation tasks. To further evaluate cross-lingual LMs, we conduct machine translation experiments between Javanese (at specific honorific levels) and Indonesian. Additionally, we explore whether LMs can generate contextually appropriate Javanese honorifics in conversation tasks, where the honorific usage should align with the social role and contextual cues. Our findings indicate that current LMs struggle with most honorific levels, exhibitinga bias toward certain honorific tiers.
- North America > Haiti (0.05)
- Asia > Indonesia > Borneo > Kalimantan > East Kalimantan > Nusantara (0.04)
- Asia > India > Haryana (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.86)
Cross-lingual Transfer Learning for Javanese Dependency Parsing
Ghiffari, Fadli Aulawi Al, Alfina, Ika, Azizah, Kurniawati
While structure learning achieves remarkable performance in high-resource languages, the situation differs for under-represented languages due to the scarcity of annotated data. This study focuses on assessing the efficacy of transfer learning in enhancing dependency parsing for Javanese, a language spoken by 80 million individuals but characterized by limited representation in natural language processing. We utilized the Universal Dependencies dataset consisting of dependency treebanks from more than 100 languages, including Javanese. We propose two learning strategies to train the model: transfer learning (TL) and hierarchical transfer learning (HTL). While TL only uses a source language to pre-train the model, the HTL method uses a source language and an intermediate language in the learning process. The results show that our best model uses the HTL method, which improves performance with an increase of 10% for both UAS and LAS evaluations compared to the baseline model.
XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese
Arisaputra, Panji, Handoyo, Alif Tri, Zahra, Amalia
ASR is a technological innovation that automatically converts verbal translations into written texts. It focuses on reducing Word Error Rate (WER) metrics when reproducing oral input. ASR's core capability is to act as an optimal connector for information exchange between human-to-human and human-to-machine entities [1]. It has become increasingly important in various domains, including air traffic control, biometric security, games, closed text for YouTube, voice message transcription, and home automation. ASR's implementation in digital media resources is not a new phenomenon, but its complexity has increased [2]. This study focuses on the rapid development of information and communication technology in Indonesia. In Figure 1, the data from the Central Statistics Agency (Badan Pusat Statistik (BPS)) [3] shows that 62.10% and 82.07% of Indonesians have access to the internet in 2021, followed by an increase in mobile phone use of 65.87%. However, less mobile technology is being abandoned, such as computers and cable phones, which are only 18.24% and 1.36%, respectively. The conclusion is that Indonesians are shifting from traditional technology to more mobile and agile devices like smartphones, which require the right modalities for effective and efficient operation.
- Asia > Indonesia > Java > Jakarta > Jakarta (0.06)
- Asia > Indonesia > Borneo > Kalimantan > East Kalimantan > Nusantara (0.04)
- Asia > Japan (0.04)
- (4 more...)
- Transportation > Infrastructure & Services (0.54)
- Transportation > Air (0.54)
- Information Technology > Security & Privacy (0.34)
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Winata, Genta Indra, Aji, Alham Fikri, Cahyawijaya, Samuel, Mahendra, Rahmad, Koto, Fajri, Romadhony, Ade, Kurniawan, Kemal, Moeljadi, David, Prasojo, Radityo Eko, Fung, Pascale, Baldwin, Timothy, Lau, Jey Han, Sennrich, Rico, Ruder, Sebastian
Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Europe > Germany > Saxony > Leipzig (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- (30 more...)
Learning an artificial language for knowledge-sharing in multilingual translation
In their recent paper Learning an artificial language for knowledge-sharing in multilingual translation, Danni Liu and Jan Niehues investigate multilingual neural machine translation models. Here, they tell us more about the main contributions of their research. Neural machine translation (NMT) is the backbone of many automatic translation platforms nowadays. The second characteristic is especially useful in low-resource conditions, where training data (translated sentence pairs) are limited. To enable knowledge-sharing between languages, and to improve translation quality on low-resource translation directions, a precondition is the ability to capture common features between languages.