Ptaszynski, Michal
Improving Polish to English Neural Machine Translation with Transfer Learning: Effects of Data Volume and Language Similarity
Eronen, Juuso, Ptaszynski, Michal, Nowakowski, Karol, Chia, Zheng Lin, Masui, Fumito
This paper investigates the impact of data volume and the use of similar languages on transfer learning in a machine translation task. We find out that having more data generally leads to better performance, as it allows the model to learn more patterns and generalizations from the data. However, related languages can also be particularly effective when there is limited data available for a specific language pair, as the model can leverage the similarities between the languages to improve performance. To demonstrate, we fine-tune mBART model for a Polish-English translation task using the OPUS-100 dataset. We evaluate the performance of the model under various transfer learning configurations, including different transfer source languages and different shot levels for Polish, and report the results. Our experiments show that a combination of related languages and larger amounts of data outperforms the model trained on related languages or larger amounts of data alone. Additionally, we show the importance of related languages in zero-shot and few-shot configurations.
Zero-shot cross-lingual transfer language selection using linguistic similarity
Eronen, Juuso, Ptaszynski, Michal, Masui, Fumito
We study the selection of transfer languages for different Natural Language Processing tasks, specifically sentiment analysis, named entity recognition and dependency parsing. In order to select an optimal transfer language, we propose to utilize different linguistic similarity metrics to measure the distance between languages and make the choice of transfer language based on this information instead of relying on intuition. We demonstrate that linguistic similarity correlates with cross-lingual transfer performance for all of the proposed tasks. We also show that there is a statistically significant difference in choosing the optimal language as the transfer source instead of English. This allows us to select a more suitable transfer language which can be used to better leverage knowledge from high-resource languages in order to improve the performance of language applications lacking data. For the study, we used datasets from eight different languages from three language families.
Adapting Multilingual Speech Representation Model for a New, Underresourced Language through Multilingual Fine-tuning and Continued Pretraining
Nowakowski, Karol, Ptaszynski, Michal, Murasaki, Kyoko, Nieuważny, Jagna
In recent years, neural models learned through self-supervised pretraining on large scale multilingual text or speech data have exhibited promising results for underresourced languages, especially when a relatively large amount of data from related language(s) is available. While the technology has a potential for facilitating tasks carried out in language documentation projects, such as speech transcription, pretraining a multilingual model from scratch for every new language would be highly impractical. We investigate the possibility for adapting an existing multilingual wav2vec 2.0 model for a new language, focusing on actual fieldwork data from a critically endangered tongue: Ainu. Specifically, we (i) examine the feasibility of leveraging data from similar languages also in fine-tuning; (ii) verify whether the model's performance can be improved by further pretraining on target language data. Our results show that continued pretraining is the most effective method to adapt a wav2vec 2.0 model for a new language and leads to considerable reduction in error rates. Furthermore, we find that if a model pretrained on a related speech variety or an unrelated language with similar phonological characteristics is available, multilingual fine-tuning using additional data from that language can have positive impact on speech recognition performance when there is very little labeled data in the target language.
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density
Eronen, Juuso, Ptaszynski, Michal, Masui, Fumito, Smywiński-Pohl, Aleksander, Leliwa, Gniewosz, Wroczynski, Michal
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods in order to estimate dataset complexity, which in turn is used to comparatively estimate the potential performance of machine learning (ML) classifiers prior to any training. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments iterations. This way we can optimize the resource-intensive training of ML models which is becoming a serious issue due to the increases in available dataset sizes and the ever rising popularity of models based on Deep Neural Networks (DNN). The problem of constantly increasing needs for more powerful computational resources is also affecting the environment due to alarmingly-growing amount of CO2 emissions caused by training of large-scale ML models. The research was conducted on multiple datasets, including popular datasets, such as Yelp business review dataset used for training typical sentiment analysis models, as well as more recent datasets trying to tackle the problem of cyberbullying, which, being a serious social problem, is also a much more sophisticated problem form the point of view of linguistic representation. We use cyberbullying datasets collected for multiple languages, namely English, Japanese and Polish. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
CAO: A Fully Automatic Emoticon Analysis System
Ptaszynski, Michal (Hokkaido University) | Maciejewski, Jacek (Hokkaido University) | Dybala, Pawel (Hokkaido University) | Rzepka, Rafal (Hokkaido University) | Araki, Kenji (Hokkaido University)
This paper presents CAO, a system for affect analysis of emoticons. Emoticons are strings of symbols widely used in text-based online communication to convey emotions. It extracts emoticons from input and determines specific emotions they express. Firstly, by matching the extracted emoticons to a raw emoticon database, containing over ten thousand emoticon samples extracted from the Web and annotated automatically. The emoticons for which emotion types could not be determined using only this database, are automatically divided into semantic areas representing "mouths" or "eyes," based on the theory of kinesics. The areas are automatically annotated according to their co-occurrence in the database. The annotation is firstly based on the eye-mouth-eye triplet, and if no such triplet is found, all semantic areas are estimated separately. This provides the system coverage exceeding 3 million possibilities. The evaluation, performed on both training and test sets, confirmed the system's capability to sufficiently detect and extract any emoticon, analyze its semantic structure and estimate the potential emotion types expressed. The system achieved nearly ideal scores, outperforming existing emoticon analysis systems.
A Pragmatic Approach to Implementation of Emotional Intelligence in Machines
Ptaszynski, Michal (Hokkaido University) | Rzepka, Rafal (Hokkaido University) | Araki, Kenji (Hokkaido University)
By this paper we would like to open a discussion on the need ofBy this paper we would like to open a discussion on the need of Emotional Intelligence as a feature in machines interacting with humans. However, we restrain from making a statement about the need of emotional experience in machines. We argue that providing machines computable means for processing emotions is a practical need requiring implementation of a set of abilities included in the Emotional Intelligence Framework. We introduce our methods and present the results of some of the first experiments we performed in this matter.