AITopics | indonesian language

Collaborating Authors

indonesian language

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Bridging Language Gaps with Adaptive RAG: Improving Indonesian Language Question Answering

Christian, William, Adamlu, Daniel, Yu, Adrian, Suhartono, Derwin

arXiv.org Artificial IntelligenceOct-27-2025

Abstract--Question Answering (QA) has seen significant improvements with the advancement of machine learning models, further studies enhanced this question answering system by retrieving external information, called Retrieval-Augmented Generation (RAG) to produce more accurate and informative answers. However, these state-of-the-art-performance is predominantly in English language. T o address this gap we made an effort of bridging language gaps by incorporating Adaptive RAG system to Indonesian language. Adaptive RAG system integrates a classifier whose task is to distinguish the question complexity, which in turn determines the strategy for answering the question. T o overcome the limited availability of Indonesian language dataset, our study employs machine translation as data augmentation approach. Experiments show reliable question complexity classifier; however, we observed significant inconsistencies in multi-retrieval answering strategy which negatively impacted the overall evaluation when this strategy was applied. Recent Large Language Models (LLMs) have shown incredible performance for a lot of Natural Language tasks. However, despite the advancement of LLMs in all tasks in natural language processing, they still have problems answering questions that require a knowledge-intensive background, often resulting in hallucination answers [7]. LLMs often provide accurate answers when entities mentioned in the question are present in their training data. Furthermore, the performance of the models has a significant correlation with the entity popularity; less popular entities are often not answered accurately by LLMs [8]. Updating the LLM's knowledge frequently is not a good solution since the training of LLM with billions or even trillions of data from all over the internet takes too much time. In contrast, recent studies have demonstrated that augmenting non-parametric knowledge (information not contained in the model's training data) to the question-answering method commonly referred to as Retrieval Augmented Generation (RAG) [9], even smaller models outperform larger models in terms of parameters [10].

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.21068

Country:

Asia > Indonesia > Java > Jakarta > Jakarta (0.05)
Asia > Indonesia > Borneo > Kalimantan > East Kalimantan > Nusantara (0.05)
Asia > Armenia (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Leveraging IndoBERT and DistilBERT for Indonesian Emotion Classification in E-Commerce Reviews

Christian, William, Adamlu, Daniel, Yu, Adrian, Suhartono, Derwin

arXiv.org Artificial IntelligenceSep-19-2025

Understanding emotions in the Indonesian language is essential for improving customer experiences in e-commerce. This study focuses on enhancing the accuracy of emotion classification in Indonesian by leveraging advanced language models, IndoBERT and DistilBERT. A key component of our approach was data processing, specifically data augmentation, which included techniques such as back-translation and synonym replacement. These methods played a significant role in boosting the model's performance. After hyperparameter tuning, IndoBERT achieved an accuracy of 80\%, demonstrating the impact of careful data processing. While combining multiple IndoBERT models led to a slight improvement, it did not significantly enhance performance. Our findings indicate that IndoBERT was the most effective model for emotion classification in Indonesian, with data augmentation proving to be a vital factor in achieving high accuracy. Future research should focus on exploring alternative architectures and strategies to improve generalization for Indonesian NLP tasks.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.14611

Country:

Asia > Indonesia > Borneo > Kalimantan > East Kalimantan > Nusantara (0.04)
Asia > Indonesia > Java > Jakarta > Jakarta (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Services > e-Commerce Services (0.61)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages

Aji, Alham Fikri, Cohn, Trevor

arXiv.org Artificial IntelligenceAug-19-2025

As one of the world's most populous countries, with 700 languages spoken, Indonesia is behind in terms of NLP progress. We introduce LoraxBench, a benchmark that focuses on low-resource languages of Indonesia and covers 6 diverse tasks: reading comprehension, open-domain QA, language inference, causal reasoning, translation, and cultural QA. Our dataset covers 20 languages, with the addition of two formality registers for three languages. We evaluate a diverse set of multilingual and region-focused LLMs and found that this benchmark is challenging. We note a visible discrepancy between performance in Indonesian and other languages, especially the low-resource ones. There is no clear lead when using a region-specific model as opposed to the general multilingual model. Lastly, we show that a change in register affects model performance, especially with registers not commonly found in social media, such as high-level politeness `Krama' Javanese.

computational linguistic, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2508.12459

Country:

North America (1.00)
Europe (1.00)
Asia > Indonesia > Sumatra (0.46)
(3 more...)

Genre: Research Report (0.40)

Industry: Education > Assessment & Standards > Student Performance (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback

NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts

Adilazuarda, Muhammad Farid, Wijanarko, Musa Izzanardi, Susanto, Lucky, Nur'aini, Khumaisa, Wijaya, Derry, Aji, Alham Fikri

arXiv.org Artificial IntelligenceFeb-25-2025

Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NusaAksara covers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks. Although unsupported by Unicode, the Lampung script is included in this dataset. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID, and show that most NLP technologies cannot handle Indonesia's local scripts, with many achieving near-zero performance.

computational linguistic, local script, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2502.18148

Country:

Asia > Indonesia > Bali (0.05)
Asia > Southeast Asia (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(32 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives

Farhansyah, Mohammad Rifqi, Johari, Muhammad Zuhdi Fikri, Amiral, Afinzaki, Purwarianti, Ayu, Yuana, Kumara Ari, Wijaya, Derry Tanti

arXiv.org Artificial IntelligenceNov-14-2024

Indonesia is one of the most diverse countries linguistically. However, despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing (NLP) research and technologies. In the past two years, several efforts have been conducted to construct NLP resources for Indonesian languages. However, most of these efforts have been focused on creating manual resources thus difficult to scale to more languages. Although many Indonesian languages do not have a web presence, locally there are resources that document these languages well in printed forms such as books, magazines, and newspapers. Digitizing these existing resources will enable scaling of Indonesian language resource construction to many more languages. In this paper, we propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build digital language resources in Indonesia. DriveThru is a platform for extracting document content utilizing Optical Character Recognition (OCR) techniques in its system to provide language resource building with less manual effort and cost. This paper also studies the utility of current state-of-the-art LLM for post-OCR correction to show the capability of increasing the character accuracy rate (CAR) and word accuracy rate (WAR) compared to off-the-shelf OCR.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.09318

Country:

Asia > Indonesia > Bali (0.05)
Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
Europe > Slovenia (0.04)
(12 more...)

Genre: Research Report (0.64)

Industry: Consumer Products & Services > Restaurants (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

Add feedback

Indo LEGO-ABSA: A Multitask Generative Aspect Based Sentiment Analysis for Indonesian Language

Suchrady, Randy Zakya, Purwarianti, Ayu

arXiv.org Artificial IntelligenceNov-3-2023

Aspect-based sentiment analysis is a method in natural language processing aimed at identifying and understanding sentiments related to specific aspects of an entity. Aspects are words or phrases that represent an aspect or attribute of a particular entity. Previous research has utilized generative pre-trained language models to perform aspect-based sentiment analysis. LEGO-ABSA is one framework that has successfully employed generative pre-trained language models in aspect-based sentiment analysis, particularly in English. LEGO-ABSA uses a multitask learning and prompting approach to enhance model performance. However, the application of this approach has not been done in the context of Bahasa Indonesia. Therefore, this research aims to implement the multitask learning and prompting approach in aspect-based sentiment analysis for Bahasa Indonesia using generative pre-trained language models. In this study, the Indo LEGO-ABSA model is developed, which is an aspect-based sentiment analysis model utilizing generative pre-trained language models and trained with multitask learning and prompting. Indo LEGO-ABSA is trained with a hotel domain dataset in the Indonesian language. The obtained results include an f1-score of 79.55% for the Aspect Sentiment Triplet Extraction task, 86.09% for Unified Aspect-based Sentiment Analysis, 79.85% for Aspect Opinion Pair Extraction, 87.45% for Aspect Term Extraction, and 88.09% for Opinion Term Extraction. Indo LEGO-ABSA adopts the LEGO-ABSA framework that employs the T5 model, specifically mT5, by applying multitask learning to train all tasks within aspect-based sentiment analysis.

aspect term, aspect-based sentiment analysis, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2311.01757

Country: Asia > Indonesia > Java > West Java > Bandung (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)

Add feedback

Lexical Diversity in Kinship Across Languages and Dialects

Khalilia, Hadi, Bella, Gábor, Freihat, Abed Alhakim, Darma, Shandy, Giunchiglia, Fausto

arXiv.org Artificial IntelligenceOct-26-2023

Languages are known to describe the world in diverse ways. Across lexicons, diversity is pervasive, appearing through phenomena such as lexical gaps and untranslatability. However, in computational resources, such as multilingual lexical databases, diversity is hardly ever represented. In this paper, we introduce a method to enrich computational lexicons with content relating to linguistic diversity. The method is verified through two large-scale case studies on kinship terminology, a domain known to be diverse across languages and cultures: one case study deals with seven Arabic dialects, while the other one with three Indonesian languages. Our results, made available as browseable and downloadable computational resources, extend prior linguistics research on kinship terminology, and provide insight into the extent of diversity even within linguistically and culturally close communities.

dialect, indonesian language, lexical diversity, (15 more...)

arXiv.org Artificial Intelligence

2308.13056

Country:

Europe > United Kingdom > UK North Sea (0.09)
Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.09)
Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)
(16 more...)

Genre: Research Report > New Finding (0.66)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU

Koto, Fajri, Aisyah, Nurul, Li, Haonan, Baldwin, Timothy

arXiv.org Artificial IntelligenceOct-21-2023

Although large language models (LLMs) are often pre-trained on large-scale multilingual texts, their reasoning abilities and real-world knowledge are mainly evaluated based on English datasets. Assessing LLM capabilities beyond English is increasingly vital but hindered due to the lack of suitable datasets. In this work, we introduce IndoMMLU, the first multi-task language understanding benchmark for Indonesian culture and languages, which consists of questions from primary school to university entrance exams in Indonesia. By employing professional teachers, we obtain 14,981 questions across 64 tasks and education levels, with 46% of the questions focusing on assessing proficiency in the Indonesian language and knowledge of nine local languages and cultures in Indonesia. Our empirical evaluations show that GPT-3.5 only manages to pass the Indonesian primary school level, with limited knowledge of local Indonesian languages and culture. Other smaller models such as BLOOMZ and Falcon perform at even lower levels.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2310.04928

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(14 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education > Educational Setting > K-12 Education > Primary School (0.91)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Domain-Specific Language Model Post-Training for Indonesian Financial NLP

Maharani, Ni Putu Intan, Yustiawan, Yoga, Rochim, Fauzy Caesar, Purwarianti, Ayu

arXiv.org Artificial IntelligenceOct-15-2023

One of the notable examples Recently, self-supervised pre-training of contextual language is Bidirectional Encoder Representations from Transformers models on large general domain corpora, such as ELMo (BERT), which has become a standard benchmark for training [7], ULM-Fit [8], XLNet [9], GPT [10], BERT [2], and NLP models for various downstream tasks. Another example is IndoBERT [1] have significantly improved performance on IndoBERT, the implementation of BERT specific for Indonesian various natural language processing downstream tasks, including language which also performs well as a building block sentence classification, token classification, and question for training task-specific NLP models for Indonesian language answering. IndoBERT, as the foundation of this research, is an [1]. However, those pre-training works focus on the general implementation of BERT in Indonesian language. IndoBERT domain in which the unlabeled text data are collected from has similar model architecture as BERT in which it is a Web domains, newswire, Wikipedia, and BookCorpus [1], [2].

dataset, language model, sentiment analysis, (17 more...)

arXiv.org Artificial Intelligence

2310.09736

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > India (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(4 more...)

Genre: Research Report (0.82)

Industry: Banking & Finance (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian

Hendria, Willy Fitra

arXiv.org Artificial IntelligenceJun-20-2023

Multimodal learning on video and text data has been receiving growing attention from many researchers in various research tasks, including text-to-video retrieval, video-to-text retrieval, and video captioning. Although many algorithms have been proposed for those challenging tasks, most of them are developed on English language datasets. Despite Indonesian being one of the most spoken languages in the world, the research progress on the multimodal video-text with Indonesian sentences is still under-explored, likely due to the absence of the public benchmark dataset. To address this issue, we construct the first public Indonesian video-text dataset by translating English sentences from the MSVD dataset to Indonesian sentences. Using our dataset, we then train neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning. The recent neural network-based approaches to video-text tasks often utilized a feature extractor that is primarily pretrained on an English vision-language dataset. Since the availability of the pretraining resources with Indonesian sentences is relatively limited, the applicability of those approaches to our dataset is still questionable. To overcome the lack of pretraining resources, we apply cross-lingual transfer learning by utilizing the feature extractors pretrained on the English dataset, and we then fine-tune the models on our Indonesian dataset. Our experimental results show that this approach can help to improve the performance for the three tasks on all metrics. Finally, we discuss potential future works using our dataset, inspiring further research in the Indonesian multimodal video-text tasks. We believe that our dataset and our experimental results could provide valuable contributions to the community. Our dataset is available on GitHub.

dataset, msvd-indonesian dataset, video, (15 more...)

arXiv.org Artificial Intelligence

2306.11341

Country:

North America > United States > New York > New York County > New York City (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback