AITopics | Adhista, Dea

Plotting

Adhista, Dea

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese

Putri, Rifki Afina, Haznitrama, Faiz Ghifari, Adhista, Dea, Oh, Alice

arXiv.org Artificial IntelligenceApr-16-2024

Large Language Models (LLMs) are increasingly being used to generate synthetic data for training and evaluating models. However, it is unclear whether they can generate a good quality of question answering (QA) dataset that incorporates knowledge and cultural nuance embedded in a language, especially for low-resource languages. In this study, we investigate the effectiveness of using LLMs in generating culturally relevant commonsense QA datasets for Indonesian and Sundanese languages. To do so, we create datasets for these languages using various methods involving both LLMs and human annotators, resulting in ~4.5K questions per language (~9K in total), making our dataset the largest of its kind. Our experiments show that automatic data adaptation from an existing English dataset is less effective for Sundanese. Interestingly, using the direct generation method on the target language, GPT-4 Turbo can generate questions with adequate general knowledge in both languages, albeit not as culturally 'deep' as humans. We also observe a higher occurrence of fluency errors in the Sundanese dataset, highlighting the discrepancy between medium- and lower-resource languages.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2402.17302

Country:

Europe (1.00)
Asia > Indonesia > Java > West Java (0.29)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.88)

Industry:

Education (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Cahyawijaya, Samuel, Lovenia, Holy, Koto, Fajri, Adhista, Dea, Dave, Emmanuel, Oktavianti, Sarah, Akbar, Salsabil Maulana, Lee, Jhonson, Shadieq, Nuur, Cenggoro, Tjeng Wawan, Linuwih, Hanung Wahyuning, Wilie, Bryan, Muridan, Galih Pradipta, Winata, Genta Indra, Moeljadi, David, Aji, Alham Fikri, Purwarianti, Ayu, Fung, Pascale

arXiv.org Artificial IntelligenceSep-19-2023

Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes.

artificial intelligence, constructing high-quality corpora, natural language, (2 more...)

arXiv.org Artificial Intelligence

2309.10661

Country:

North America > United States (0.24)
Asia > Indonesia (0.24)

Genre: Research Report > New Finding (0.53)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback