Text Classification
Zero-shot Text Classification vs. Similarity-based Text Classification
This post is based on our NLPIR 2022 paper "Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches". You can read more details there. Unsupervised text classification approaches aim to perform categorization without using annotated data during training and therefore offer the potential to reduce annotation costs . Generally, unsupervised text classification approaches aim to map text to labels based on their textual description, without using annotated training data. To accomplish this, there exist mainly two categories of approaches. The first category can be summarized under similarity-based approaches.
Is word segmentation necessary for Vietnamese sentiment classification?
Nguyen, Duc-Vu, Nguyen, Ngan Luu-Thuy
To the best of our knowledge, this paper made the first attempt to answer whether word segmentation is necessary for Vietnamese sentiment classification. To do this, we presented five pre-trained monolingual S4- based language models for Vietnamese, including one model without word segmentation, and four models using RDRsegmenter, uitnlp, pyvi, or underthesea toolkits in the pre-processing data phase. According to comprehensive experimental results on two corpora, including the VLSP2016-SA corpus of technical article reviews from the news and social media and the UIT-VSFC corpus of the educational survey, we have two suggestions. Firstly, using traditional classifiers like Naive Bayes or Support Vector Machines, word segmentation maybe not be necessary for the Vietnamese sentiment classification corpus, which comes from the social domain. Secondly, word segmentation is necessary for Vietnamese sentiment classification when word segmentation is used before using the BPE method and feeding into the deep learning model. In this way, the RDRsegmenter is the stable toolkit for word segmentation among the uitnlp, pyvi, and underthesea toolkits.
Distant Reading of the German Coalition Deal: Recognizing Policy Positions with BERT-based Text Classification
Zylla, Michael, Haider, Thomas
In postwar Germany, the federal government is usually formed by several political parties (Schmidt, 2007, p. 97). Over the past 16 years, these government coalitions were led by the Christian Democratic parliamentary group (CDU/CSU), most recently in cooperation with the Social Democratic Party (SPD), which, following the federal election in 2021, was unwilling to negotiate with their former partner, calling for new alliances to achieve a majority in parliament. Finally, the leaders of the Free Democratic Party (FDP), the Greens and SPD, despite mixed support from the party bases, signed a coalition agreement. Some journalists even regarded the FDP, which gained access to two key ministries, the secret winner of the negotiations (Fรผrstenau, 2021), also because the Greens did not see some of their desired climate change policies implemented (Lauter, 2021). In this research, we are interested in how the coalition agreement was assembled regarding the individual party contributions. To that end, we utilize methods from Natural Language Processing, which have seen widespread adoption in political science (Wilkerson and Casas, 2017; Merz et al., 2016; Rauh, 2015; Slapin and Proksch, 2008).
Learning to Detect Noisy Labels Using Model-Based Features
Wang, Zhihao, Lin, Zongyu, Liu, Peiqi, ZHeng, Guidong, Wen, Junjie, Chen, Xianxin, Chen, Yujun, Yang, Zhilin
Label noise is ubiquitous in various machine learning scenarios such as self-labeling with model predictions and erroneous data annotation. Many existing approaches are based on heuristics such as sample losses, which might not be flexible enough to achieve optimal solutions. Meta learning based methods address this issue by learning a data selection function, but can be hard to optimize. In light of these pros and cons, we propose Selection-Enhanced Noisy label Training (SENT) that does not rely on meta learning while having the flexibility of being data-driven. SENT transfers the noise distribution to a clean set and trains a model to distinguish noisy labels from clean ones using model-based features. Empirically, on a wide range of tasks including text classification and speech recognition, SENT improves performance over strong baselines under the settings of self-training and label corruption.
Text classification in shipping industry using unsupervised models and Transformer based supervised models
Obtaining labelled data in a particular context could be expensive and time consuming. Although different algorithms, including unsupervised learning, semi-supervised learning, self-learning have been adopted, the performance of text classification varies with context. Given the lack of labelled dataset, we proposed a novel and simple unsupervised text classification model to classify cargo content in international shipping industry using the Standard International Trade Classification (SITC) codes. Our method stems from representing words using pretrained Glove Word Embeddings and finding the most likely label using Cosine Similarity. To compare unsupervised text classification model with supervised classification, we also applied several Transformer models to classify cargo content. Due to lack of training data, the SITC numerical codes and the corresponding textual descriptions were used as training data. A small number of manually labelled cargo content data was used to evaluate the classification performances of the unsupervised classification and the Transformer based supervised classification. The comparison reveals that unsupervised classification significantly outperforms Transformer based supervised classification even after increasing the size of the training dataset by 30%. Lacking training data is a key bottleneck that prohibits deep learning models (such as Transformers) from successful practical applications. Unsupervised classification can provide an alternative efficient and effective method to classify text when there is scarce training data.
Interpreting the Prediction of BERT Model for Text Classification
Integrated gradients is a method to compute the attribution of each feature of a deep learning model based on the gradient of the model's output (prediction) with respect to the input. This method applies to any deep learning model for classification and regression tasks. As an example, let's say that we have a text classification model and we want to interpret its prediction. With integrated gradients, in the end, we will get the attribution score of each input word with respect to the final prediction. We can use this attribution score to find out which words play an important role in our model's final prediction.
NLP and Customer Funnel: Using PySpark to Weight Events
The customer funnel, also known as the marketing funnel or sales funnel, is a conceptual model that represents the journey a customer goes through as they move from awareness of a product or service to the point of purchase. The funnel is usually depicted as a wide top that narrows as it progresses downward, with each stage representing a different phase in the customer's journey. Understanding the customer funnel can help businesses understand how to effectively market and sell their products or services and identify areas where they can improve the customer experience. TF-IDF, which stands for "term frequency-inverse document frequency," is a statistical measure that can be used to assign weights to words or phrases in a document. It is commonly used in information retrieval and natural language processing tasks, including text classification, clustering, and search. In the context of the customer funnel, TF-IDF could be used to weigh different events or actions that a customer takes as they move through the funnel.
On-the-fly Denoising for Data Augmentation in Natural Language Understanding
Fang, Tianqing, Zhou, Wenxuan, Liu, Fangyu, Zhang, Hongming, Song, Yangqiu, Chen, Muhao
Data Augmentation (DA) is frequently used to automatically provide additional training data without extra human annotation. However, data augmentation may introduce noisy data that impairs training. To guarantee the quality of augmented data, existing methods either assume no noise exists in the augmented data and adopt consistency training or use simple heuristics such as training loss and diversity constraints to filter out ``noisy'' data. However, those filtered examples may still contain useful information, and dropping them completely causes loss of supervision signals. In this paper, based on the assumption that the original dataset is cleaner than the augmented data, we propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data. A simple self-regularization module is applied to force the model prediction to be consistent across two distinct dropouts to further prevent overfitting on noisy labels. Our method can be applied to augmentation techniques in general and can consistently improve the performance on both text classification and question-answering tasks.
Multi-View Active Learning for Short Text Classification in User-Generated Data
Karisani, Payam, Karisani, Negin, Xiong, Li
Mining user-generated data often suffers from the lack of enough labeled data, short document lengths, and the informal user language. In this paper, we propose a novel active learning model to overcome these obstacles in the tasks tailored for query phrases--e.g., detecting positive reports of natural disasters. Our model has three novelties: 1) It is the first approach to employ multi-view active learning in this domain. 2) It uses the Parzen-Rosenblatt window method to integrate the representativeness measure into multi-view active learning. 3) It employs a query-by-committee strategy, based on the agreement between predictors, to address the usually noisy language of the documents in this domain. We evaluate our model in four publicly available Twitter datasets with distinctly different applications. We also compare our model with a wide range of baselines including those with multiple classifiers. The experiments testify that our model is highly consistent and outperforms existing models.
Less is More: Parameter-Free Text Classification with Gzip
Jiang, Zhiying, Yang, Matthew Y. R., Tsirlin, Mikhail, Tang, Raphael, Lin, Jimmy
Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.