Goto

Collaborating Authors

 sinhala word


Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

Sumanathilaka, Deshan, Perera, Sameera, Dharmasiri, Sachithya, Athukorala, Maneesha, Herath, Anuja Dilrukshi, Dias, Rukshan, Gamage, Pasindu, Weerasinghe, Ruvan, Priyadarshana, Y. H. P. P.

arXiv.org Artificial Intelligence

The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.


Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study

De Mel, W. M. Yomal, de Silva, Nisansa

arXiv.org Artificial Intelligence

This research investigates the area of Music Information Retrieval (MIR) and Music Emotion Recognition (MER) in relation to Sinhala songs, an underexplored field in music studies. The purpose of this study is to analyze the behavior of Sinhala comments on YouTube Sinhala song videos using social media comments as primary data sources. These included comments from 27 YouTube videos containing 20 different Sinhala songs, which were carefully selected so that strict linguistic reliability would be maintained and relevancy ensured. This process led to a total of 93,116 comments being gathered upon which the dataset was refined further by advanced filtering methods and transliteration mechanisms resulting into 63,471 Sinhala comments. Additionally, 964 stop-words specific for the Sinhala language were algorithmically derived out of which 182 matched exactly with English stop-words from NLTK corpus once translated. Also, comparisons were made between general domain corpora in Sinhala against the YouTube Comment Corpus in Sinhala confirming latter as good representation of general domain. The meticulously curated data set as well as the derived stop-words form important resources for future research in the fields of MIR and MER, since they could be used and demonstrate that there are possibilities with computational techniques to solve complex musical experiences across varied cultural traditions


Swa Bhasha: Message-Based Singlish to Sinhala Transliteration

Athukorala, Maneesha U., Sumanathilaka, Deshan K.

arXiv.org Artificial Intelligence

Machine Transliteration provides the ability to transliterate a basic language into different languages in a computational way. Transliteration is an important technical process that has caught the attention most recently. The Sinhala transliteration has many constraints because of the insufficiency of resources in the Sinhala language. Due to these limitations, Sinhala Transliteration is highly complex and time-consuming. Therefore, the majority of the Sri Lankans uses non-formal texting language named 'Singlish' to make that process simple. This study has focused on the transliteration of the Singlish language at the word level by reducing the complication in the transliteration. A new approach of coding system has invented with the rule-based approach that can map the matching Sinhala words even without the vowels. Various typing patterns were collected by different communities for this. The collected data have analyzed with every Sinhala character and unique Singlish patterns related to them were generated. The system has introduced a newly initiated numeric coding system to use with the Singlish letters by matching with the recognized typing patterns. For the mapping process, fuzzy logic-based implementation has used. A codified dictionary has also implemented including unique numeric values. In this system, Each Romanized English letter was assigned with a unique numeric code that can construct a unique pattern for each word. The system can identify the most relevant Sinhala word that matches with the pattern of the Singlish word or it gives the most related word suggestions. For example, the word 'kiyanna,kianna, kynna, kynn, kiynna' have mapped with the accurate Sinhala word "kiyanna". These results revealed that the 'Swa Bhasha' transliteration system has the ability to enhance the Sinhala users' experience while conducting the texting in Singlish to Sinhala.