AITopics | genta indra winata

Collaborating Authors

genta indra winata

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

Wiyono, Vanessa Rebecca, Anugraha, David, Purwarianti, Ayu, Winata, Genta Indra

arXiv.org Artificial IntelligenceNov-13-2025

Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset designed to evaluate the naturalness and quality of LLM-generated text. The dataset contains 522 prompts and yields 4,099 human-annotated pairwise preferences from comparisons across five instruction-tuned LLMs. All annotations are natively written in Indonesian with strong inter-annotator agreement, measured by Krippendorff's alpha. Our benchmark spans 10 diverse categories, enabling practitioners to identify LLMs' fine-grained strengths and weaknesses.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.22159

Country: Asia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

LLMs Are Few-Shot In-Context Low-Resource Language Learners

Cahyawijaya, Samuel, Lovenia, Holy, Fung, Pascale

arXiv.org Artificial IntelligenceJun-25-2024

In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages. Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages. Our code is publicly released at https://github.com/SamuelCahyawijaya/in-context-alignment

alignment, computational linguistic, low-resource language, (14 more...)

arXiv.org Artificial Intelligence

2403.16512

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
South America (0.05)
Asia > East Asia (0.04)
(22 more...)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Adilazuarda, Muhammad Farid, Cahyawijaya, Samuel, Winata, Genta Indra, Fung, Pascale, Purwarianti, Ayu

arXiv.org Artificial IntelligenceNov-21-2023

In addition, we explore Processing (NLP) have introduced an immense methods to improve the robustness of LMs to improvement in many aspects, including code-mixed text. Using our IndoRobusta-Shot, standardized benchmarks (Wilie et al., 2020; we perform adversarial training to improve the Cahyawijaya et al., 2021; Koto et al., 2020; Winata code-mixed robustness of LMs. We explore three et al., 2022), large pre-trained language model kinds of tuning strategies: 1) code-mix only, 2) (LM) (Wilie et al., 2020; Cahyawijaya et al., 2021; two-steps, and 3) joint training, and empirically Koto et al., 2020), and resource expansion covering search for the best strategy to improve the model local Indonesian languages (Tri Apriani, 2016; robustness on code-mixed data.

computational linguistic, proceedings, robustness, (12 more...)

arXiv.org Artificial Intelligence

2311.12405

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Hong Kong (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Social Media (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning

Cahyawijaya, Samuel, Lovenia, Holy, Yu, Tiezheng, Chung, Willy, Fung, Pascale

arXiv.org Artificial IntelligenceOct-24-2023

Large language models (LLMs) that are tuned with instructions have demonstrated remarkable capabilities in various tasks and languages. However, their ability to generalize to underrepresented languages is limited due to the scarcity of available data. Additionally, directly adapting new languages to instruction-tuned LLMs can result in catastrophic forgetting, which leads to the loss of multitasking ability. To address this issue, we propose InstructAlign which uses continual crosslingual instruction tuning to enable LLMs to align new unseen languages with previously learned high-resource languages. Our results demonstrate the effectiveness of InstructAlign in enabling the model to understand low-resource languages with limited parallel data while preventing catastrophic forgetting. Our work contributes to the advancement of language adaptation methods, particularly for adapting instruction-tuned LLMs to underrepresented languages. Our code is released on https://github.com/HLTCHKUST/InstructAlign

computational linguistic, instructalign, instruction, (14 more...)

arXiv.org Artificial Intelligence

2305.13627

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
(15 more...)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Multilingual Large Language Models Are Not (Yet) Code-Switchers

Zhang, Ruochen, Cahyawijaya, Samuel, Cruz, Jan Christian Blaise, Winata, Genta Indra, Aji, Alham Fikri

arXiv.org Artificial IntelligenceOct-23-2023

Multilingual Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks, exhibiting state-of-the-art performance through zero-shot or few-shot prompting methods. While there have been extensive studies on their abilities in monolingual tasks, the investigation of their potential in the context of code-switching (CSW), the practice of alternating languages within an utterance, remains relatively uncharted. In this paper, we provide a comprehensive empirical analysis of various multilingual LLMs, benchmarking their performance across four tasks: sentiment analysis, machine translation, summarization and word-level language identification. Our results indicate that despite multilingual LLMs exhibiting promising outcomes in certain tasks using zero or few-shot prompting, they still underperform in comparison to fine-tuned models of much smaller scales. We argue that current "multilingualism" in LLMs does not inherently imply proficiency with code-switching texts, calling for future research to bridge this discrepancy.

computational linguistic, multilingual llm, proceedings, (9 more...)

arXiv.org Artificial Intelligence

2305.14235

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
North America > United States > Maryland > Baltimore (0.04)
(6 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Yong, Zheng-Xin, Zhang, Ruochen, Forde, Jessica Zosa, Wang, Skyler, Subramonian, Arjun, Lovenia, Holy, Cahyawijaya, Samuel, Winata, Genta Indra, Sutawika, Lintang, Cruz, Jan Christian Blaise, Tan, Yin Lin, Phan, Long, Garcia, Rowena, Solorio, Thamar, Aji, Alham Fikri

arXiv.org Artificial IntelligenceSep-12-2023

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.

chatgpt, code-mixed data, computational linguistic, (16 more...)

arXiv.org Artificial Intelligence

2303.13592

Country:

Asia > East Asia (0.24)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Philippines (0.04)
(23 more...)

Genre:

Research Report (1.00)
Personal > Interview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

Cahyawijaya, Samuel, Lovenia, Holy, Aji, Alham Fikri, Winata, Genta Indra, Wilie, Bryan, Mahendra, Rahmad, Wibisono, Christian, Romadhony, Ade, Vincentio, Karissa, Koto, Fajri, Santoso, Jennifer, Moeljadi, David, Wirawan, Cahya, Hudi, Frederikus, Parmonangan, Ivan Halim, Alfina, Ika, Wicaksono, Muhammad Satrio, Putra, Ilham Firdausi, Rahmadani, Samsul, Oenang, Yulianti, Septiandri, Ali Akbar, Jaya, James, Dhole, Kaustubh D., Suryani, Arie Ardiyanti, Putri, Rifki Afina, Su, Dan, Stevens, Keith, Nityasya, Made Nindyatama, Adilazuarda, Muhammad Farid, Ignatius, Ryan, Diandaru, Ryandito, Yu, Tiezheng, Ghifari, Vito, Dai, Wenliang, Xu, Yan, Damapuspita, Dyah, Tho, Cuk, Karo, Ichwanul Muslim Karo, Fatyanosa, Tirana Noor, Ji, Ziwei, Fung, Pascale, Neubig, Graham, Baldwin, Timothy, Ruder, Sebastian, Sujaini, Herry, Sakti, Sakriani, Purwarianti, Ayu

arXiv.org Artificial IntelligenceJul-21-2023

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.

large language model, machine learning, natural language, (24 more...)

arXiv.org Artificial Intelligence

2212.09648

Country:

North America > United States > Texas > Dallas County > Dallas (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Timor-Leste (0.14)
(64 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Law (0.67)
Government (0.67)
Information Technology > Services (0.67)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(5 more...)

Add feedback

NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

Cahyawijaya, Samuel, Aji, Alham Fikri, Lovenia, Holy, Winata, Genta Indra, Wilie, Bryan, Mahendra, Rahmad, Koto, Fajri, Moeljadi, David, Vincentio, Karissa, Romadhony, Ade, Purwarianti, Ayu

arXiv.org Artificial IntelligenceAug-1-2022

At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers do not publish their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus makes performing reproducible and data-centric research in Indonesian NLP even more arduous. Rising to this challenge, we initiate the first Indonesian NLP crowdsourcing effort, NusaCrowd. NusaCrowd strives to provide the largest datasheets aggregation with standardized data loading for NLP tasks in all Indonesian languages. By enabling open and centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle the data scarcity problem hindering NLP progress in Indonesia and bring NLP practitioners to move towards collaboration.

contributor, dataset, nusacrowd, (12 more...)

arXiv.org Artificial Intelligence

2207.10524

Country:

North America > United States (0.05)
North America > Dominican Republic (0.04)
North America > Canada (0.04)
(5 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Social Media > Crowdsourcing (0.34)

Add feedback

Language Models are Few-shot Multilingual Learners

Winata, Genta Indra, Madotto, Andrea, Lin, Zhaojiang, Liu, Rosanne, Yosinski, Jason, Fung, Pascale

arXiv.org Artificial IntelligenceSep-15-2021

General-purpose language models have demonstrated impressive capabilities, performing on par with state-of-the-art approaches on a range of downstream natural language processing (NLP) tasks and benchmarks when inferring instructions from very few examples. Here, we evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages without any parameter updates. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones. Finally, we find the in-context few-shot cross-lingual prediction results of language models are significantly better than random prediction, and they are competitive compared to the existing state-of-the-art cross-lingual models.

dataset, preprint arxiv, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2109.07684

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

Add feedback

CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue System

Ishii, Etsuko, Xu, Yan, Winata, Genta Indra, Lin, Zhaojiang, Madotto, Andrea, Liu, Zihan, Xu, Peng, Fung, Pascale

arXiv.org Artificial IntelligenceJun-7-2021

Information-seeking dialogue systems, including knowledge identification and response generation, aim to respond to users with fluent, coherent, and informative responses based on users' needs, which. To tackle this challenge, we utilize data augmentation methods and several training techniques with the pre-trained language models to learn a general pattern of the task and thus achieve promising performance. In DialDoc21 competition, our system achieved 74.95 F1 score and 60.74 Exact Match score in subtask 1, and 37.72 SacreBLEU score in subtask 2. Empirical analysis is provided to explain the effectiveness of our approaches.

dataset, genta indra winata, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2106.0353

Country:

Asia > China > Hong Kong (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)

Add feedback