indonesian language
Bridging Language Gaps with Adaptive RAG: Improving Indonesian Language Question Answering
Christian, William, Adamlu, Daniel, Yu, Adrian, Suhartono, Derwin
Abstract--Question Answering (QA) has seen significant improvements with the advancement of machine learning models, further studies enhanced this question answering system by retrieving external information, called Retrieval-Augmented Generation (RAG) to produce more accurate and informative answers. However, these state-of-the-art-performance is predominantly in English language. T o address this gap we made an effort of bridging language gaps by incorporating Adaptive RAG system to Indonesian language. Adaptive RAG system integrates a classifier whose task is to distinguish the question complexity, which in turn determines the strategy for answering the question. T o overcome the limited availability of Indonesian language dataset, our study employs machine translation as data augmentation approach. Experiments show reliable question complexity classifier; however, we observed significant inconsistencies in multi-retrieval answering strategy which negatively impacted the overall evaluation when this strategy was applied. Recent Large Language Models (LLMs) have shown incredible performance for a lot of Natural Language tasks. However, despite the advancement of LLMs in all tasks in natural language processing, they still have problems answering questions that require a knowledge-intensive background, often resulting in hallucination answers [7]. LLMs often provide accurate answers when entities mentioned in the question are present in their training data. Furthermore, the performance of the models has a significant correlation with the entity popularity; less popular entities are often not answered accurately by LLMs [8]. Updating the LLM's knowledge frequently is not a good solution since the training of LLM with billions or even trillions of data from all over the internet takes too much time. In contrast, recent studies have demonstrated that augmenting non-parametric knowledge (information not contained in the model's training data) to the question-answering method commonly referred to as Retrieval Augmented Generation (RAG) [9], even smaller models outperform larger models in terms of parameters [10].
- Asia > Indonesia > Java > Jakarta > Jakarta (0.05)
- Asia > Indonesia > Borneo > Kalimantan > East Kalimantan > Nusantara (0.05)
- Asia > Armenia (0.04)
- (2 more...)
Leveraging IndoBERT and DistilBERT for Indonesian Emotion Classification in E-Commerce Reviews
Christian, William, Adamlu, Daniel, Yu, Adrian, Suhartono, Derwin
Understanding emotions in the Indonesian language is essential for improving customer experiences in e-commerce. This study focuses on enhancing the accuracy of emotion classification in Indonesian by leveraging advanced language models, IndoBERT and DistilBERT. A key component of our approach was data processing, specifically data augmentation, which included techniques such as back-translation and synonym replacement. These methods played a significant role in boosting the model's performance. After hyperparameter tuning, IndoBERT achieved an accuracy of 80\%, demonstrating the impact of careful data processing. While combining multiple IndoBERT models led to a slight improvement, it did not significantly enhance performance. Our findings indicate that IndoBERT was the most effective model for emotion classification in Indonesian, with data augmentation proving to be a vital factor in achieving high accuracy. Future research should focus on exploring alternative architectures and strategies to improve generalization for Indonesian NLP tasks.
- Information Technology > Data Science (1.00)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages
Aji, Alham Fikri, Cohn, Trevor
As one of the world's most populous countries, with 700 languages spoken, Indonesia is behind in terms of NLP progress. We introduce LoraxBench, a benchmark that focuses on low-resource languages of Indonesia and covers 6 diverse tasks: reading comprehension, open-domain QA, language inference, causal reasoning, translation, and cultural QA. Our dataset covers 20 languages, with the addition of two formality registers for three languages. We evaluate a diverse set of multilingual and region-focused LLMs and found that this benchmark is challenging. We note a visible discrepancy between performance in Indonesian and other languages, especially the low-resource ones. There is no clear lead when using a region-specific model as opposed to the general multilingual model. Lastly, we show that a change in register affects model performance, especially with registers not commonly found in social media, such as high-level politeness `Krama' Javanese.
- North America (1.00)
- Europe (1.00)
- Asia > Indonesia > Sumatra (0.46)
- (3 more...)
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
Adilazuarda, Muhammad Farid, Wijanarko, Musa Izzanardi, Susanto, Lucky, Nur'aini, Khumaisa, Wijaya, Derry, Aji, Alham Fikri
Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NusaAksara covers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks. Although unsupported by Unicode, the Lampung script is included in this dataset. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID, and show that most NLP technologies cannot handle Indonesia's local scripts, with many achieving near-zero performance.
- Asia > Indonesia > Bali (0.05)
- Asia > Southeast Asia (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (32 more...)
DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives
Farhansyah, Mohammad Rifqi, Johari, Muhammad Zuhdi Fikri, Amiral, Afinzaki, Purwarianti, Ayu, Yuana, Kumara Ari, Wijaya, Derry Tanti
Indonesia is one of the most diverse countries linguistically. However, despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing (NLP) research and technologies. In the past two years, several efforts have been conducted to construct NLP resources for Indonesian languages. However, most of these efforts have been focused on creating manual resources thus difficult to scale to more languages. Although many Indonesian languages do not have a web presence, locally there are resources that document these languages well in printed forms such as books, magazines, and newspapers. Digitizing these existing resources will enable scaling of Indonesian language resource construction to many more languages. In this paper, we propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build digital language resources in Indonesia. DriveThru is a platform for extracting document content utilizing Optical Character Recognition (OCR) techniques in its system to provide language resource building with less manual effort and cost. This paper also studies the utility of current state-of-the-art LLM for post-OCR correction to show the capability of increasing the character accuracy rate (CAR) and word accuracy rate (WAR) compared to off-the-shelf OCR.
- Asia > Indonesia > Bali (0.05)
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
- Europe > Slovenia (0.04)
- (12 more...)
Indo LEGO-ABSA: A Multitask Generative Aspect Based Sentiment Analysis for Indonesian Language
Suchrady, Randy Zakya, Purwarianti, Ayu
Aspect-based sentiment analysis is a method in natural language processing aimed at identifying and understanding sentiments related to specific aspects of an entity. Aspects are words or phrases that represent an aspect or attribute of a particular entity. Previous research has utilized generative pre-trained language models to perform aspect-based sentiment analysis. LEGO-ABSA is one framework that has successfully employed generative pre-trained language models in aspect-based sentiment analysis, particularly in English. LEGO-ABSA uses a multitask learning and prompting approach to enhance model performance. However, the application of this approach has not been done in the context of Bahasa Indonesia. Therefore, this research aims to implement the multitask learning and prompting approach in aspect-based sentiment analysis for Bahasa Indonesia using generative pre-trained language models. In this study, the Indo LEGO-ABSA model is developed, which is an aspect-based sentiment analysis model utilizing generative pre-trained language models and trained with multitask learning and prompting. Indo LEGO-ABSA is trained with a hotel domain dataset in the Indonesian language. The obtained results include an f1-score of 79.55% for the Aspect Sentiment Triplet Extraction task, 86.09% for Unified Aspect-based Sentiment Analysis, 79.85% for Aspect Opinion Pair Extraction, 87.45% for Aspect Term Extraction, and 88.09% for Opinion Term Extraction. Indo LEGO-ABSA adopts the LEGO-ABSA framework that employs the T5 model, specifically mT5, by applying multitask learning to train all tasks within aspect-based sentiment analysis.
Lexical Diversity in Kinship Across Languages and Dialects
Khalilia, Hadi, Bella, Gábor, Freihat, Abed Alhakim, Darma, Shandy, Giunchiglia, Fausto
Languages are known to describe the world in diverse ways. Across lexicons, diversity is pervasive, appearing through phenomena such as lexical gaps and untranslatability. However, in computational resources, such as multilingual lexical databases, diversity is hardly ever represented. In this paper, we introduce a method to enrich computational lexicons with content relating to linguistic diversity. The method is verified through two large-scale case studies on kinship terminology, a domain known to be diverse across languages and cultures: one case study deals with seven Arabic dialects, while the other one with three Indonesian languages. Our results, made available as browseable and downloadable computational resources, extend prior linguistics research on kinship terminology, and provide insight into the extent of diversity even within linguistically and culturally close communities.
- Europe > United Kingdom > UK North Sea (0.09)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.09)
- Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)
- (16 more...)
Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU
Koto, Fajri, Aisyah, Nurul, Li, Haonan, Baldwin, Timothy
Although large language models (LLMs) are often pre-trained on large-scale multilingual texts, their reasoning abilities and real-world knowledge are mainly evaluated based on English datasets. Assessing LLM capabilities beyond English is increasingly vital but hindered due to the lack of suitable datasets. In this work, we introduce IndoMMLU, the first multi-task language understanding benchmark for Indonesian culture and languages, which consists of questions from primary school to university entrance exams in Indonesia. By employing professional teachers, we obtain 14,981 questions across 64 tasks and education levels, with 46% of the questions focusing on assessing proficiency in the Indonesian language and knowledge of nine local languages and cultures in Indonesia. Our empirical evaluations show that GPT-3.5 only manages to pass the Indonesian primary school level, with limited knowledge of local Indonesian languages and culture. Other smaller models such as BLOOMZ and Falcon perform at even lower levels.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Dominican Republic (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (14 more...)
Domain-Specific Language Model Post-Training for Indonesian Financial NLP
Maharani, Ni Putu Intan, Yustiawan, Yoga, Rochim, Fauzy Caesar, Purwarianti, Ayu
One of the notable examples Recently, self-supervised pre-training of contextual language is Bidirectional Encoder Representations from Transformers models on large general domain corpora, such as ELMo (BERT), which has become a standard benchmark for training [7], ULM-Fit [8], XLNet [9], GPT [10], BERT [2], and NLP models for various downstream tasks. Another example is IndoBERT [1] have significantly improved performance on IndoBERT, the implementation of BERT specific for Indonesian various natural language processing downstream tasks, including language which also performs well as a building block sentence classification, token classification, and question for training task-specific NLP models for Indonesian language answering. IndoBERT, as the foundation of this research, is an [1]. However, those pre-training works focus on the general implementation of BERT in Indonesian language. IndoBERT domain in which the unlabeled text data are collected from has similar model architecture as BERT in which it is a Web domains, newswire, Wikipedia, and BookCorpus [1], [2].
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > India (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (4 more...)
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
Multimodal learning on video and text data has been receiving growing attention from many researchers in various research tasks, including text-to-video retrieval, video-to-text retrieval, and video captioning. Although many algorithms have been proposed for those challenging tasks, most of them are developed on English language datasets. Despite Indonesian being one of the most spoken languages in the world, the research progress on the multimodal video-text with Indonesian sentences is still under-explored, likely due to the absence of the public benchmark dataset. To address this issue, we construct the first public Indonesian video-text dataset by translating English sentences from the MSVD dataset to Indonesian sentences. Using our dataset, we then train neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning. The recent neural network-based approaches to video-text tasks often utilized a feature extractor that is primarily pretrained on an English vision-language dataset. Since the availability of the pretraining resources with Indonesian sentences is relatively limited, the applicability of those approaches to our dataset is still questionable. To overcome the lack of pretraining resources, we apply cross-lingual transfer learning by utilizing the feature extractors pretrained on the English dataset, and we then fine-tune the models on our Indonesian dataset. Our experimental results show that this approach can help to improve the performance for the three tasks on all metrics. Finally, we discuss potential future works using our dataset, inspiring further research in the Indonesian multimodal video-text tasks. We believe that our dataset and our experimental results could provide valuable contributions to the community. Our dataset is available on GitHub.
- North America > United States > New York > New York County > New York City (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (6 more...)