lexicon expansion
TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages
Nkongolo, Mike, Vorster, Hilton, Warren, Josh, Naick, Trevor, Vanmali, Deandre, Mashapha, Masana, Brand, Luke, Fernandes, Alyssa, Calitz, Janco, Makhoba, Sibusiso
Low-resource African languages remain underrepresented in sentiment analysis research, resulting in limited lexical resources and reduced model performance in multilingual applications. This gap restricts equitable access to Natural Language Processing (NLP) technologies and hinders downstream tasks such as public-health monitoring, digital governance, and financial inclusion. To address this challenge, this paper introduces TriLex, a three-stage retrieval-augmented framework that integrates corpus-based extraction, cross-lingual mapping, and Retrieval-Augmented Generation (RAG) driven lexicon refinement for scalable sentiment lexicon expansion in low-resource languages. Using an expanded lexicon, we evaluate two leading African language models (AfroXLMR and AfriBERTa) across multiple case studies. Results show that AfroXLMR consistently achieves the strongest performance, with F1-scores exceeding 80% for isiXhosa and isiZulu, aligning with previously reported ranges (71-75%), and demonstrating high multilingual stability with narrow confidence intervals. AfriBERTa, despite lacking pre-training on the target languages, attains moderate but reliable F1-scores around 64%, confirming its effectiveness under constrained computational settings. Comparative analysis shows that both models outperform traditional machine learning baselines, while ensemble evaluation combining AfroXLMR variants indicates complementary improvements in precision and overall stability. These findings confirm that the TriLex framework, together with AfroXLMR and AfriBERTa, provides a robust and scalable approach for sentiment lexicon development and multilingual sentiment analysis in low-resource South African languages.
DualCoTs: Dual Chain-of-Thoughts Prompting for Sentiment Lexicon Expansion of Idioms
Niu, Fuqiang, Tan, Minghuan, Zhang, Bowen, Yang, Min, Xu, Ruifeng
Idioms represent a ubiquitous vehicle for conveying sentiments in the realm of everyday discourse, rendering the nuanced analysis of idiom sentiment crucial for a comprehensive understanding of emotional expression within real-world texts. Nevertheless, the existing corpora dedicated to idiom sentiment analysis considerably limit research in text sentiment analysis. In this paper, we propose an innovative approach to automatically expand the sentiment lexicon for idioms, leveraging the capabilities of large language models through the application of Chain-of-Thought prompting. To demonstrate the effectiveness of this approach, we integrate multiple existing resources and construct an emotional idiom lexicon expansion dataset (called EmoIdiomE), which encompasses a comprehensive repository of Chinese and English idioms. Then we designed the Dual Chain-of-Thoughts (DualCoTs) method, which combines insights from linguistics and psycholinguistics, to demonstrate the effectiveness of using large models to automatically expand the sentiment lexicon for idioms. Experiments show that DualCoTs is effective in idioms sentiment lexicon expansion in both Chinese and English. For reproducibility, we will release the data and code upon acceptance.
LEXpander: applying colexification networks to automated lexicon expansion
Di Natale, Anna, Garcia, David
Recent approaches to text analysis from social media and other corpora rely on word lists to detect topics, measure meaning, or to select relevant documents. These lists are often generated by applying computational lexicon expansion methods to small, manually-curated sets of root words. Despite the wide use of this approach, we still lack an exhaustive comparative analysis of the performance of lexicon expansion methods and how they can be improved with additional linguistic data. In this work, we present LEXpander, a method for lexicon expansion that leverages novel data on colexification, i.e. semantic networks connecting words based on shared concepts and translations to other languages. We evaluate LEXpander in a benchmark including widely used methods for lexicon expansion based on various word embedding models and synonym networks. We find that LEXpander outperforms existing approaches in terms of both precision and the trade-off between precision and recall of generated word lists in a variety of tests. Our benchmark includes several linguistic categories and sentiment variables in English and German. We also show that the expanded word lists constitute a high-performing text analysis method in application cases to various corpora. This way, LEXpander poses a systematic automated solution to expand short lists of words into exhaustive and accurate word lists that can closely approximate word lists generated by experts in psychology and linguistics.
Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention
Zeng, Xiangkai (Beihang University) | Yang, Cheng (Tsinghua University) | Tu, Cunchao (Tsinghua University) | Liu, Zhiyuan (Tsinghua University) | Sun, Maosong (Tsinghua University)
Linguistic Inquiry and Word Count (LIWC) is a word counting software tool which has been used for quantitative text analysis in many fields. Due to its success and popularity, the core lexicon has been translated into Chinese and many other languages. However, the lexicon only contains several thousand of words, which is deficient compared with the number of common words in Chinese. Current approaches often require manually expanding the lexicon, but it often takes too much time and requires linguistic experts to extend the lexicon. To address this issue, we propose to expand the LIWC lexicon automatically. Specifically, we consider it as a hierarchical classification problem and utilize the Sequence-to-Sequence model to classify words in the lexicon. Moreover, we use the sememe information with the attention mechanism to capture the exact meanings of a word, so that we can expand a more precise and comprehensive lexicon. The experimental results show that our model has a better understanding of word meanings with the help of sememes and achieves significant and consistent improvements compared with the state-of-the-art methods. The source code of this paper can be obtained from https://github.com/thunlp/Auto_CLIWC.
Semi-supervised emotion lexicon expansion with label propagation and specialized word embeddings
There exist two main approaches to automatically extract affective orientation: lexicon-based and corpus-based. In this work, we argue that these two methods are compatible and show that combining them can improve the accuracy of emotion classifiers. In particular, we introduce a novel variant of the Label Propagation algorithm that is tailored to distributed word representations, we apply batch gradient descent to accelerate the optimization of label propagation and to make the optimization feasible for large graphs, and we propose a reproducible method for emotion lexicon expansion. We conclude that label propagation can expand an emotion lexicon in a meaningful way and that the expanded emotion lexicon can be leveraged to improve the accuracy of an emotion classifier.
Positive, Negative, or Neutral: Learning an Expanded Opinion Lexicon from Emoticon-Annotated Tweets
Bravo-Marquez, Felipe (The University of Waikato) | Frank, Eibe (The University of Waikato) | Pfahringer, Bernhard (The University of Waikato)
We present a supervised framework for expanding an opinion lexicon for tweets. The lexicon contains part-of-speech (POS) disambiguated entries with a three-dimensional probability distribution for positive, negative, and neutral polarities. To obtain this distribution using machine learning, we propose word-level attributes based on POS tags and information calculated from streams of emoticon-annotated tweets. Our experimental results show that our method outperforms the three-dimensional word-level polarity classification performance obtained by semantic orientation, a state-of-the-art measure for establishing world-level sentiment.