AITopics | Hettiarachchi, Hansi

Collaborating Authors

Hettiarachchi, Hansi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)

Hettiarachchi, Hansi, Ranasinghe, Tharindu, Rayson, Paul, Mitkov, Ruslan, Gaber, Mohamed, Premasiri, Damith, Tan, Fiona Anting, Uyangodage, Lasitha

arXiv.org Artificial IntelligenceDec-20-2024

The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.

computational linguistic, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2412.16365

Country:

Europe (1.00)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.47)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.46)

Add feedback

NSINA: A News Corpus for Sinhala

Hettiarachchi, Hansi, Premasiri, Damith, Uyangodage, Lasitha, Ranasinghe, Tharindu

arXiv.org Artificial IntelligenceMar-25-2024

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.

large language model, natural language, sinhala, (17 more...)

arXiv.org Artificial Intelligence

2403.16571

Country:

Europe > France (0.14)
North America > United States > Minnesota (0.14)
Europe > Spain (0.14)
(2 more...)

Genre: Research Report (0.50)

Industry: Media > News (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.78)

Add feedback

SOLD: Sinhala Offensive Language Dataset

Ranasinghe, Tharindu, Anuradha, Isuri, Premasiri, Damith, Silva, Kanishka, Hettiarachchi, Hansi, Uyangodage, Lasitha, Zampieri, Marcos

arXiv.org Artificial IntelligenceDec-1-2022

The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2212.00851

Country:

Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > United Kingdom > England > West Midlands (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Government (0.93)
Information Technology > Services (0.67)
(2 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Event Causality Identification with Causal News Corpus -- Shared Task 3, CASE 2022

Tan, Fiona Anting, Hettiarachchi, Hansi, Hürriyetoğlu, Ali, Caselli, Tommaso, Uca, Onur, Liza, Farhana Ferdousi, Oostdijk, Nelleke

arXiv.org Artificial IntelligenceNov-22-2022

The Event Causality Identification Shared Task of CASE 2022 involved two subtasks working on the Causal News Corpus. Subtask 1 required participants to predict if a sentence contains a causal relation or not. This is a supervised binary classification task. Subtask 2 required participants to identify the Cause, Effect and Signal spans per causal sentence. This could be seen as a supervised sequence labeling task. For both subtasks, participants uploaded their predictions for a held-out test set, and ranking was done based on binary F1 and macro F1 scores for Subtask 1 and 2, respectively. This paper summarizes the work of the 17 teams that submitted their results to our competition and 12 system description papers that were received. The best F1 scores achieved for Subtask 1 and 2 were 86.19% and 54.15%, respectively. All the top-performing approaches involved pre-trained language models fine-tuned to the targeted task. We further discuss these approaches and analyze errors across participants' systems in this paper.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2211.12154

Country:

Europe (1.00)
Asia (0.68)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Extended Multilingual Protest News Detection -- Shared Task 1, CASE 2021 and 2022

Hürriyetoğlu, Ali, Mutlu, Osman, Duruşan, Fırat, Uca, Onur, Gürel, Alaeddin Selçuk, Radford, Benjamin, Dai, Yaoyao, Hettiarachchi, Hansi, Stoehr, Niklas, Nomoto, Tadashi, Slavcheva, Milena, Vargas, Francielle, Javid, Aaqib, Beyhan, Fatih, Yörük, Erdem

arXiv.org Artificial IntelligenceNov-21-2022

We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This task is a continuation of CASE 2021 that consists of four subtasks that are i) document classification, ii) sentence classification, iii) event sentence coreference identification, and iv) event extraction. The CASE 2022 extension consists of expanding the test data with more data in previously available languages, namely, English, Hindi, Portuguese, and Spanish, and adding new test data in Mandarin, Turkish, and Urdu for Sub-task 1, document classification. The training data from CASE 2021 in English, Portuguese and Spanish were utilized. Therefore, predicting document labels in Hindi, Mandarin, Turkish, and Urdu occurs in a zero-shot setting. The CASE 2022 workshop accepts reports on systems developed for predicting test data of CASE 2021 as well. We observe that the best systems submitted by CASE 2022 participants achieve between 79.71 and 84.06 F1-macro for new languages in a zero-shot setting. The winning approaches are mainly ensembling models and merging data in multiple languages. The best two submissions on CASE 2021 data outperform submissions from last year for Subtask 1 and Subtask 2 in all languages. Only the following scenarios were not outperformed by new submissions on CASE 2021: Subtask 3 Portuguese \& Subtask 4 English.

artificial intelligence, natural language, text classification, (14 more...)

arXiv.org Artificial Intelligence

2211.1136

Country:

North America > United States (0.46)
Europe (0.29)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.54)

Add feedback

TransWiC at SemEval-2021 Task 2: Transformer-based Multilingual and Cross-lingual Word-in-Context Disambiguation

Hettiarachchi, Hansi, Ranasinghe, Tharindu

arXiv.org Artificial IntelligenceApr-9-2021

Identifying whether a word carries the same meaning or different meaning in two contexts is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. Most of the previous work in this area rely on language-specific resources making it difficult to generalise across languages. Considering this limitation, our approach to SemEval-2021 Task 2 is based only on pretrained transformer models and does not use any language-specific processing and resources. Despite that, our best model achieves 0.90 accuracy for English-English subtask which is very compatible compared to the best result of the subtask; 0.93 accuracy. Our approach also achieves satisfactory results in other monolingual and cross-lingual language pairs as well.

computational linguistics, deep learning, neural network, (15 more...)

arXiv.org Artificial Intelligence

2104.04632

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.15)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.83)

Add feedback

BRUMS at SemEval-2020 Task 12 : Transformer based Multilingual Offensive Language Identification in Social Media

Ranasinghe, Tharindu, Hettiarachchi, Hansi

arXiv.org Artificial IntelligenceOct-13-2020

In this paper, we describe the team \textit{BRUMS} entry to OffensEval 2: Multilingual Offensive Language Identification in Social Media in SemEval-2020. The OffensEval organizers provided participants with annotated datasets containing posts from social media in Arabic, Danish, English, Greek and Turkish. We present a multilingual deep learning model to identify offensive language in social media. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility between languages.

deep learning, neural network, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2010.06278

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Industry: Information Technology (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

BRUMS at SemEval-2020 Task 3: Contextualised Embeddings forPredicting the (Graded) Effect of Context in Word Similarity

Hettiarachchi, Hansi, Ranasinghe, Tharindu

arXiv.org Artificial IntelligenceOct-13-2020

This paper presents the team BRUMS submission to SemEval-2020 Task 3: Graded Word Similarity in Context. The system utilises state-of-the-art contextualised word embeddings, which have some task-specific adaptations, including stacked embeddings and average embeddings. Overall, the approach achieves good evaluation scores across all the languages, while maintaining simplicity. Following the final rankings, our approach is ranked within the top 5 solutions of each language while preserving the 1st position of Finnish subtask 2.

bert, deep learning, neural network, (19 more...)

arXiv.org Artificial Intelligence

2010.06269

Country:

Europe > United Kingdom > England (0.14)
Europe > Sweden (0.14)
Europe > France (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback