Text Classification
ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling
Alcoforado, Alexandre, Ferraz, Thomas Palmeira, Gerber, Rodrigo, Bustos, Enzo, Oliveira, André Seidel, Veloso, Bruno Miguel, Siqueira, Fabio Levy, Costa, Anna Helena Reali
Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12 % in the F1 score in the FolhaUOL dataset.
TextRGNN: Residual Graph Neural Networks for Text Classification
Chen, Jiayuan, Zhang, Boyu, Xu, Yinfei, Wang, Meng
Recently, text classification model based on graph neural network (GNN) has attracted more and more attention. Most of these models adopt a similar network paradigm, that is, using pre-training node embedding initialization and two-layer graph convolution. In this work, we propose TextRGNN, an improved GNN structure that introduces residual connection to deepen the convolution network depth. Our structure can obtain a wider node receptive field and effectively suppress the over-smoothing of node features. In addition, we integrate the probabilistic language model into the initialization of graph node embedding, so that the non-graph semantic information of can be better extracted. The experimental results show that our model is general and efficient. It can significantly improve the classification accuracy whether in corpus level or text level, and achieve SOTA performance on a wide range of text classification datasets.
Does QA-based intermediate training help fine-tuning language models for text classification?
Fine-tuning pre-trained language models for downstream tasks has become a norm for NLP. Recently it is found that intermediate training based on high-level inference tasks such as Question Answering (QA) can improve the performance of some language models for target tasks. However it is not clear if intermediate training generally benefits various language models. In this paper, using the SQuAD-2.0 QA task for intermediate training for target text classification tasks, we experimented on eight tasks for single-sequence classification and eight tasks for sequence-pair classification using two base and two compact language models. Our experiments show that QA-based intermediate training generates varying transfer performance across different language models, except for similar QA tasks.
RheFrameDetect: A Text Classification System for Automatic Detection of Rhetorical Frames in AI from Open Sources
Ghosh, Saurav, Loustaunau, Philippe
Rhetorical Frames in AI can be thought of as expressions that describe AI development as a competition between two or more actors, such as governments or companies. Examples of such Frames include robotic arms race, AI rivalry, technological supremacy, cyberwarfare dominance and 5G race. Detection of Rhetorical Frames from open sources can help us track the attitudes of governments or companies towards AI, specifically whether attitudes are becoming more cooperative or competitive over time. Given the rapidly increasing volumes of open sources (online news media, twitter, blogs), it is difficult for subject matter experts to identify Rhetorical Frames in (near) real-time. Moreover, these sources are in general unstructured (noisy) and therefore, detecting Frames from these sources will require state-of-the-art text classification techniques. In this paper, we develop RheFrameDetect, a text classification system for (near) real-time capture of Rhetorical Frames from open sources. Given an input document, RheFrameDetect employs text classification techniques at multiple levels (document level and paragraph level) to identify all occurrences of Frames used in the discussion of AI. We performed extensive evaluation of the text classification techniques used in RheFrameDetect against human annotated Frames from multiple news sources. To further demonstrate the effectiveness of RheFrameDetect, we show multiple case studies depicting the Frames identified by RheFrameDetect compared against human annotated Frames.
Unsupervised Text Classification with Lbl2Vec
Text classification is the task of assigning a sentence or document an appropriate category. The categories depend on the selected dataset and can cover arbitrary subjects. Therefore, text classifiers can be used to organize, structure, and categorize any kind of text. Common approaches use supervised learning to classify texts. Especially BERT-based language models achieved very good text classification results in recent years.
A Survey on Text Classification: From Shallow to Deep Learning
Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021, focusing on models from traditional models to deep learning. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification. We then discuss each of these categories in detail, dealing with both the technical developments and benchmark datasets that support tests of predictions. A comprehensive comparison between different techniques, as well as identifying the pros and cons of various evaluation metrics are also provided in this survey. Finally, we conclude by summarizing key implications, future research directions, and the challenges facing the research area.
Open Vocabulary Electroencephalography-To-Text Decoding and Zero-shot Sentiment Classification
State-of-the-art brain-to-text systems have achieved great success in decoding language directly from brain signals using neural networks. However, current approaches are limited to small closed vocabularies which are far from enough for natural communication. In addition, most of the high-performing approaches require data from invasive devices (e.g., ECoG). In this paper, we extend the problem to open vocabulary Electroencephalography(EEG)-To-Text Sequence-To-Sequence decoding and zero-shot sentence sentiment classification on natural reading tasks. We hypothesis that the human brain functions as a special text encoder and propose a novel framework leveraging pre-trained language models (e.g., BART). Our model achieves a 40.1% BLEU-1 score on EEG-To-Text decoding and a 55.6% F1 score on zero-shot EEG-based ternary sentiment classification, which significantly outperforms supervised baselines. Furthermore, we show that our proposed model can handle data from various subjects and sources, showing great potential for a high-performance open vocabulary brain-to-text system once sufficient data is available
Unifying Model Explainability and Robustness for Joint Text Classification and Rationale Extraction
Li, Dongfang, Hu, Baotian, Chen, Qingcai, Xu, Tujie, Tao, Jingcong, Zhang, Yunan
Recent works have shown explainability and robustness are two crucial ingredients of trustworthy and reliable text classification. However, previous works usually address one of two aspects: i) how to extract accurate rationales for explainability while being beneficial to prediction; ii) how to make the predictive model robust to different types of adversarial attacks. Intuitively, a model that produces helpful explanations should be more robust against adversarial attacks, because we cannot trust the model that outputs explanations but changes its prediction under small perturbations. To this end, we propose a joint classification and rationale extraction model named AT-BMC. It includes two key mechanisms: mixed Adversarial Training (AT) is designed to use various perturbations in discrete and embedding space to improve the model's robustness, and Boundary Match Constraint (BMC) helps to locate rationales more precisely with the guidance of boundary information. Performances on benchmark datasets demonstrate that the proposed AT-BMC outperforms baselines on both classification and rationale extraction by a large margin. Robustness analysis shows that the proposed AT-BMC decreases the attack success rate effectively by up to 69%. The empirical results indicate that there are connections between robust models and better explanations.
Adversarial Examples for Extreme Multilabel Text Classification
Qaraei, Mohammadreza, Babbar, Rohit
Extreme Multilabel Text Classification (XMTC) is a text classification problem in which, (i) the output space is extremely large, (ii) each data point may have multiple positive labels, and (iii) the data follows a strongly imbalanced distribution. With applications in recommendation systems and automatic tagging of web-scale documents, the research on XMTC has been focused on improving prediction accuracy and dealing with imbalanced data. However, the robustness of deep learning based XMTC models against adversarial examples has been largely underexplored. In this paper, we investigate the behaviour of XMTC models under adversarial attacks. To this end, first, we define adversarial attacks in multilabel text classification problems. We categorize attacking multilabel text classifiers as (a) positive-targeted, where the target positive label should fall out of top-k predicted labels, and (b) negative-targeted, where the target negative label should be among the top-k predicted labels. Then, by experiments on APLC-XLNet and AttentionXML, we show that XMTC models are highly vulnerable to positive-targeted attacks but more robust to negative-targeted ones. Furthermore, our experiments show that the success rate of positive-targeted adversarial attacks has an imbalanced distribution. More precisely, tail classes are highly vulnerable to adversarial attacks for which an attacker can generate adversarial samples with high similarity to the actual data-points. To overcome this problem, we explore the effect of rebalanced loss functions in XMTC where not only do they increase accuracy on tail classes, but they also improve the robustness of these classes against adversarial attacks. The code for our experiments is available at https://github.com/xmc-aalto/adv-xmtc
An Introduction to Text Classification in Python for Beginners
I've realized that while students are generally able to copy and paste online code and somehow make their code work, many students who are new-ish to text classification might still not understand what every line of code does. This article will hence attempt to make every line as clear as possible. Using this dataset, we aim to build a machine learning model that can predict if a given review has a negative or positive sentiment. For instance, if we feed our model a review "this is quite bad and disappointing", it should predict the review's sentiment as a 0. If we feed it a review "quite happy with my purchase", our model should predict the review's sentiment as a 1. Run this command in your command prompt or terminal, and the libraries pandas and scikit-learn will be installed on your computer. Here, we need to use the import keyword to tell Python that we want to use these libraries.