Information Extraction
MEMD-ABSA: A Multi-Element Multi-Domain Dataset for Aspect-Based Sentiment Analysis
Cai, Hongjie, Song, Nan, Wang, Zengzhi, Xie, Qiming, Zhao, Qiankun, Li, Ke, Wu, Siwei, Liu, Shijie, Yu, Jianfei, Xia, Rui
Aspect-based sentiment analysis is a long-standing research interest in the field of opinion mining, and in recent years, researchers have gradually shifted their focus from simple ABSA subtasks to end-to-end multi-element ABSA tasks. However, the datasets currently used in the research are limited to individual elements of specific tasks, usually focusing on in-domain settings, ignoring implicit aspects and opinions, and with a small data scale. To address these issues, we propose a large-scale Multi-Element Multi-Domain dataset (MEMD) that covers the four elements across five domains, including nearly 20,000 review sentences and 30,000 quadruples annotated with explicit and implicit aspects and opinions for ABSA research. Meanwhile, we evaluate generative and non-generative baselines on multiple ABSA subtasks under the open domain setting, and the results show that open domain ABSA as well as mining implicit aspects and opinions remain ongoing challenges to be addressed. The datasets are publicly released at \url{https://github.com/NUSTM/MEMD-ABSA}.
Emotion Analysis of Tweets Banning Education in Afghanistan
Hussiny, Mohammad Ali, Øvrelid, Lilja
This paper introduces the first emotion annotated dataset for the Dari variant of Persian spoken in Afghanistan. The LetHerLearn dataset contains 7,600 tweets posted in reaction to the Taliban ban of women rights to education in 2022 and has been manually annotated according to Ekman emotion categories. We here detail the data collection and annotation process, present relevant dataset statistics as well as initial experiments on the resulting dataset, benchmarking a number of different neural architectures for the task of Dari emotion classification.
What Sentiment and Fun Facts We Learnt Before FIFA World Cup Qatar 2022 Using Twitter and AI
She, James, Swart-Arries, Kamilla, Belal, Mohammad, Wong, Simon
Twitter is a social media platform bridging most countries and allows real-time news discovery. Since the tweets on Twitter are usually short and express public feelings, thus provide a source for opinion mining and sentiment analysis for global events. This paper proposed an effective solution, in providing a sentiment on tweets related to the FIFA World Cup. At least 130k tweets, as the first in the community, are collected and implemented as a dataset to evaluate the performance of the proposed machine learning solution. These tweets are collected with the related hashtags and keywords of the Qatar World Cup 2022. The Vader algorithm is used in this paper for sentiment analysis. Through the machine learning method and collected Twitter tweets, we discovered the sentiments and fun facts of several aspects important to the period before the World Cup. The result shows people are positive to the opening of the World Cup.
ConKI: Contrastive Knowledge Injection for Multimodal Sentiment Analysis
Yu, Yakun, Zhao, Mingjun, Qi, Shi-ang, Sun, Feiran, Wang, Baoxun, Guo, Weidong, Wang, Xiaoli, Yang, Lei, Niu, Di
Multimodal Sentiment Analysis leverages multimodal signals to detect the sentiment of a speaker. Previous approaches concentrate on performing multimodal fusion and representation learning based on general knowledge obtained from pretrained models, which neglects the effect of domain-specific knowledge. In this paper, we propose Contrastive Knowledge Injection (ConKI) for multimodal sentiment analysis, where specific-knowledge representations for each modality can be learned together with general knowledge representations via knowledge injection based on an adapter architecture. In addition, ConKI uses a hierarchical contrastive learning procedure performed between knowledge types within every single modality, across modalities within each sample, and across samples to facilitate the effective learning of the proposed representations, hence improving multimodal sentiment predictions. The experiments on three popular multimodal sentiment analysis benchmarks show that ConKI outperforms all prior methods on a variety of performance metrics.
Unleashing the Power of User Reviews: Exploring Airline Choices at Catania Airport, Italy
Miracula, Vincenzo, Picone, Antonio
This study aims to investigate the possible relationship between the mechanisms of social influence and the choice of airline, through the use of new tools, with the aim of understanding whether they can contribute to a better understanding of the factors influencing the decisions of consumers in the aviation sector. We have chosen to extract user reviews from well-known platforms: Trustpilot, Google, and Twitter. By combining web scraping techniques, we have been able to collect a comprehensive dataset comprising a wide range of user opinions, feedback, and ratings. We then refined the BERT model to focus on insightful sentiment in the context of airline reviews. Through our analysis, we observed an intriguing trend of average negative sentiment scores across various airlines, giving us deeper insight into the dynamics between airlines and helping us identify key partnerships, popular routes, and airlines that play a central role in the aeronautical ecosystem of Catania airport during the specified period. Our investigation led us to find that, despite an airline having received prestigious awards as a low-cost leader in Europe for two consecutive years 2021 and 2022, the "Catanese" user tends to suffer the dominant position of other companies. Understanding the impact of positive reviews and leveraging sentiment analysis can help airlines improve their reputation, attract more customers, and ultimately gain a competitive edge in the marketplace.
L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models
Pingle, Aabha, Vyawahare, Aditya, Joshi, Isha, Tangsali, Rahul, Joshi, Raviraj
The exploration of sentiment analysis in low-resource languages, such as Marathi, has been limited due to the availability of suitable datasets. In this work, we present L3Cube-MahaSent-MD, a multi-domain Marathi sentiment analysis dataset, with four different domains - movie reviews, general tweets, TV show subtitles, and political tweets. The dataset consists of around 60,000 manually tagged samples covering 3 distinct sentiments - positive, negative, and neutral. We create a sub-dataset for each domain comprising 15k samples. The MahaSent-MD is the first comprehensive multi-domain sentiment analysis dataset within the Indic sentiment landscape. We fine-tune different monolingual and multilingual BERT models on these datasets and report the best accuracy with the MahaBERT model. We also present an extensive in-domain and cross-domain analysis thus highlighting the need for low-resource multi-domain datasets. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .
Resume Information Extraction via Post-OCR Text Processing
Helli, Selahattin Serdar, Tanberk, Senem, Cavsak, Sena Nur
Information extraction (IE), one of the main tasks of natural language processing (NLP), has recently increased importance in the use of resumes. In studies on the text to extract information from the CV, sentence classification was generally made using NLP models. In this study, it is aimed to extract information by classifying all of the text groups after pre-processing such as Optical Character Recognition (OCT) and object recognition with the YOLOv8 model of the resumes. The text dataset consists of 286 resumes collected for 5 different (education, experience, talent, personal and language) job descriptions in the IT industry. The dataset created for object recognition consists of 1198 resumes, which were collected from the open-source internet and labeled as sets of text. BERT, BERT-t, DistilBERT, RoBERTa and XLNet were used as models. F1 score variances were used to compare the model results. In addition, the YOLOv8 model has also been reported comparatively in itself. As a result of the comparison, DistilBERT was showed better results despite having a lower number of parameters than other models.
Stress Testing BERT Anaphora Resolution Models for Reaction Extraction in Chemical Patents
Yueh, Chieling, Kanoulas, Evangelos, Martins, Bruno, Thorne, Camilo, Akhondi, Saber
The high volume of published chemical patents and the importance of a timely acquisition of their information gives rise to automating information extraction from chemical patents. Anaphora resolution is an important component of comprehensive information extraction, and is critical for extracting reactions. In chemical patents, there are five anaphoric relations of interest: co-reference, transformed, reaction associated, work up, and contained. Our goal is to investigate how the performance of anaphora resolution models for reaction texts in chemical patents differs in a noise-free and noisy environment and to what extent we can improve the robustness against noise of the model.
Constructing Colloquial Dataset for Persian Sentiment Analysis of Social Microblogs
Mazoochi, Mojtaba, Rabiei, Leyla, Rahmani, Farzaneh, Rajabi, Zeinab
Introduction: Microblogging websites have massed rich data sources for sentiment analysis and opinion mining. In this regard, sentiment classification has frequently proven inefficient because microblog posts typically lack syntactically consistent terms and representatives since users on these social networks do not like to write lengthy statements. Also, there are some limitations to low-resource languages. The Persian language has exceptional characteristics and demands unique annotated data and models for the sentiment analysis task, which are distinctive from text features within the English dialect. Method: This paper first constructs a user opinion dataset called ITRC-Opinion by collaborative environment and insource way. Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram. Second, this study proposes a new deep convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts. The constructed datasets are used to evaluate the presented model. Furthermore, some models, such as LSTM, CNN-RNN, BiLSTM, and BiGRU with different word embeddings, including Fasttext, Glove, and Word2vec, investigated our dataset and evaluated the results. Results: The results demonstrate the benefit of our dataset and the proposed model (72% accuracy), displaying meaningful improvement in sentiment classification performance.
Multi-aspect Multilingual and Cross-lingual Parliamentary Speech Analysis
Miok, Kristian, Hidalgo-Tenorio, Encarnacion, Osenova, Petya, Benitez-Castro, Miguel-Angel, Robnik-Sikonja, Marko
Parliamentary and legislative debate transcripts provide informative insight into elected politicians' opinions, positions, and policy preferences. They are interesting for political and social sciences as well as linguistics and natural language processing (NLP) research. While existing research studied individual parliaments, we apply advanced NLP methods to a joint and comparative analysis of six national parliaments (Bulgarian, Czech, French, Slovene, Spanish, and United Kingdom) between 2017 and 2020. We analyze emotions and sentiment in the transcripts from the ParlaMint dataset collection and assess if the age, gender, and political orientation of speakers can be detected from their speeches. The results show some commonalities and many surprising differences among the analyzed countries.