Text Classification
Extended Multilingual Protest News Detection -- Shared Task 1, CASE 2021 and 2022
Hürriyetoğlu, Ali, Mutlu, Osman, Duruşan, Fırat, Uca, Onur, Gürel, Alaeddin Selçuk, Radford, Benjamin, Dai, Yaoyao, Hettiarachchi, Hansi, Stoehr, Niklas, Nomoto, Tadashi, Slavcheva, Milena, Vargas, Francielle, Javid, Aaqib, Beyhan, Fatih, Yörük, Erdem
We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This task is a continuation of CASE 2021 that consists of four subtasks that are i) document classification, ii) sentence classification, iii) event sentence coreference identification, and iv) event extraction. The CASE 2022 extension consists of expanding the test data with more data in previously available languages, namely, English, Hindi, Portuguese, and Spanish, and adding new test data in Mandarin, Turkish, and Urdu for Sub-task 1, document classification. The training data from CASE 2021 in English, Portuguese and Spanish were utilized. Therefore, predicting document labels in Hindi, Mandarin, Turkish, and Urdu occurs in a zero-shot setting. The CASE 2022 workshop accepts reports on systems developed for predicting test data of CASE 2021 as well. We observe that the best systems submitted by CASE 2022 participants achieve between 79.71 and 84.06 F1-macro for new languages in a zero-shot setting. The winning approaches are mainly ensembling models and merging data in multiple languages. The best two submissions on CASE 2021 data outperform submissions from last year for Subtask 1 and Subtask 2 in all languages. Only the following scenarios were not outperformed by new submissions on CASE 2021: Subtask 3 Portuguese \& Subtask 4 English.
GUDN: A novel guide network with label reinforcement strategy for extreme multi-label text classification
Wang, Qing, Zhu, Jia, Shu, Hongji, Asamoah, Kwame Omono, Shi, Jianyang, Zhou, Cong
In natural language processing, extreme multi-label text classification is an emerging but essential task. The problem of extreme multi-label text classification (XMTC) is to recall some of the most relevant labels for a text from an extremely large label set. Large-scale pre-trained models have brought a new trend to this problem. Though the large-scale pre-trained models have made significant achievements on this problem, the valuable fine-tuned methods have yet to be studied. Though label semantics have been introduced in XMTC, the vast semantic gap between texts and labels has yet to gain enough attention. This paper builds a new guide network (GUDN) to help fine-tune the pre-trained model to instruct classification later. Furthermore, GUDN uses raw label semantics combined with a helpful label reinforcement strategy to effectively explore the latent space between texts and labels, narrowing the semantic gap, which can further improve predicted accuracy. Experimental results demonstrate that GUDN outperforms state-of-the-art methods on Eurlex-4k and has competitive results on other popular datasets. In an additional experiment, we investigated the input lengths' influence on the Transformer-based model's accuracy. Our source code is released at https://t.hk.uy/aFSH.
TCBERT: A Technical Report for Chinese Topic Classification BERT
Han, Ting, Pan, Kunhao, Chen, Xinyu, Song, Dingjie, Fan, Yuchen, Gao, Xinyu, Gan, Ruyi, Zhang, Jiaxing
Bidirectional Encoder Representations from Transformers or BERT~\cite{devlin-etal-2019-bert} has been one of the base models for various NLP tasks due to its remarkable performance. Variants customized for different languages and tasks are proposed to further improve the performance. In this work, we investigate supervised continued pre-training~\cite{gururangan-etal-2020-dont} on BERT for Chinese topic classification task. Specifically, we incorporate prompt-based learning and contrastive learning into the pre-training. To adapt to the task of Chinese topic classification, we collect around 2.1M Chinese data spanning various topics. The pre-trained Chinese Topic Classification BERTs (TCBERTs) with different parameter sizes are open-sourced at \url{https://huggingface.co/IDEA-CCNL}.
Building for Tomorrow: Assessing the Temporal Persistence of Text Classifiers
Alkhalifa, Rabab, Kochkina, Elena, Zubiaga, Arkaitz
A supervised text classification model relies on labelled datasets to train the model (Sebastiani, 2002). From an experimental perspective, the design and evaluation of classification models typically rely on data pertaining to fixed periods of time. Recent research demonstrates that such models, while showing competitive performance in their experimental environment, underperform when they need to classify new data that is distant in time from that observed during training (Alkhalifa and Zubiaga, 2022). This deterioration of performance has been demonstrated for different classification tasks, including topic classification (Rocha, Mourão, Pereira, Gonçalves, and Meira, 2008), sentiment classification (Lukes and Søgaard, 2018), hate speech detection (Florio, Basile, Polignano, Basile, and Patti, 2020), stance detection (Alkhalifa, Kochkina, and Zubiaga, 2021) and political ideology detection (Röttger and Pierrehumbert, 2021). This performance drop can happen for multiple reasons, including among others the evolution in language use (Smith, 2004) or the evolution of public opinion (Bonilla and Mo, 2019) and its extent may vary (Alkhalifa et al., 2021). This poses an important challenge and limitation on such models when one plans to continue using the model over a long period of time to classify new, incoming data, as can be the case with a stream of user-generated contents (Cheng, Chen, Lee, and Li, 2021).
Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification
Xiao, Lin, Xu, Pengyu, Jing, Liping, Zhang, Xiangliang
Multi-label text classification (MLTC) is one of the key tasks in natural language processing. It aims to assign multiple target labels to one document. Due to the uneven popularity of labels, the number of documents per label follows a long-tailed distribution in most cases. It is much more challenging to learn classifiers for data-scarce tail labels than for data-rich head labels. The main reason is that head labels usually have sufficient information, e.g., a large intra-class diversity, while tail labels do not. In response, we propose a Pairwise Instance Relation Augmentation Network (PIRAN) to augment tailed-label documents for balancing tail labels and head labels. PIRAN consists of a relation collector and an instance generator. The former aims to extract the document pairwise relations from head labels. Taking these relations as perturbations, the latter tries to generate new document instances in high-level feature space around the limited given tailed-label instances. Meanwhile, two regularizers (diversity and consistency) are designed to constrain the generation process. The consistency-regularizer encourages the variance of tail labels to be close to head labels and further balances the whole datasets. And diversity-regularizer makes sure the generated instances have diversity and avoids generating redundant instances. Extensive experimental results on three benchmark datasets demonstrate that PIRAN consistently outperforms the SOTA methods, and dramatically improves the performance of tail labels.
AdaPrompt: Adaptive Model Training for Prompt-based NLP
Chen, Yulong, Liu, Yang, Dong, Li, Wang, Shuohang, Zhu, Chenguang, Zeng, Michael, Zhang, Yue
Prompt-based learning, with its capability to tackle zero-shot and few-shot NLP tasks, has gained much attention in community. The main idea is to bridge the gap between NLP downstream tasks and language modeling (LM), by mapping these tasks into natural language prompts, which are then filled by pre-trained language models (PLMs). However, for prompt learning, there are still two salient gaps between NLP tasks and pretraining. First, prompt information is not necessarily sufficiently present during LM pretraining. Second, task-specific data are not necessarily well represented during pretraining. We address these two issues by proposing AdaPrompt, adaptively retrieving external data for continual pretraining of PLMs by making use of both task and prompt characteristics. In addition, we make use of knowledge in Natural Language Inference models for deriving adaptive verbalizers. Experimental results on five NLP benchmarks show that AdaPrompt can improve over standard PLMs in few-shot settings. In addition, in zero-shot settings, our method outperforms standard prompt-based methods by up to 26.35\% relative error reduction.
GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation
Guo, Biyang, Gong, Yeyun, Shen, Yelong, Han, Songqiao, Huang, Hailiang, Duan, Nan, Chen, Weizhu
We introduce GENIUS: a conditional text generation model using sketches as input, which can fill in the missing contexts for a given sketch (key information consisting of textual spans, phrases, or words, concatenated by mask tokens). GENIUS is pre-trained on a large-scale textual corpus with a novel reconstruction from sketch objective using an extreme and selective masking strategy, enabling it to generate diverse and high-quality texts given sketches. Comparison with other competitive conditional language models (CLMs) reveals the superiority of GENIUS's text generation quality. We further show that GENIUS can be used as a strong and ready-to-use data augmentation tool for various natural language processing (NLP) tasks. Most existing textual data augmentation methods are either too conservative, by making small changes to the original text, or too aggressive, by creating entirely new samples. With GENIUS, we propose GeniusAug, which first extracts the target-aware sketches from the original training set and then generates new samples based on the sketches. Empirical experiments on 6 text classification datasets show that GeniusAug significantly improves the models' performance in both in-distribution (ID) and out-of-distribution (OOD) settings. We also demonstrate the effectiveness of GeniusAug on named entity recognition (NER) and machine reading comprehension (MRC) tasks. (Code and models are publicly available at https://github.com/microsoft/SCGLab and https://github.com/beyondguo/genius)
An Efficient Active Learning Pipeline for Legal Text Classification
Mamooler, Sepideh, Lebret, Rémi, Massonnet, Stéphane, Aberer, Karl
Active Learning (AL) is a powerful tool for learning with less labeled data, in particular, for specialized domains, like legal documents, where unlabeled data is abundant, but the annotation requires domain expertise and is thus expensive. Recent works have shown the effectiveness of AL strategies for pre-trained language models. However, most AL strategies require a set of labeled samples to start with, which is expensive to acquire. In addition, pre-trained language models have been shown unstable during fine-tuning with small datasets, and their embeddings are not semantically meaningful. In this work, we propose a pipeline for effectively using active learning with pre-trained language models in the legal domain. To this end, we leverage the available unlabeled data in three phases. First, we continue pre-training the model to adapt it to the downstream task. Second, we use knowledge distillation to guide the model's embeddings to a semantically meaningful space. Finally, we propose a simple, yet effective, strategy to find the initial set of labeled samples with fewer actions compared to existing methods. Our experiments on Contract-NLI, adapted to the classification task, and LEDGAR benchmarks show that our approach outperforms standard AL strategies, and is more efficient. Furthermore, our pipeline reaches comparable results to the fully-supervised approach with a small performance gap, and dramatically reduced annotation cost. Code and the adapted data will be made available.
Disentangling Task Relations for Few-shot Text Classification via Self-Supervised Hierarchical Task Clustering
Zha, Juan, Li, Zheng, Wei, Ying, Zhang, Yu
Few-Shot Text Classification (FSTC) imitates humans to learn a new text classifier efficiently with only few examples, by leveraging prior knowledge from historical tasks. However, most prior works assume that all the tasks are sampled from a single data source, which cannot adapt to real-world scenarios where tasks are heterogeneous and lie in different distributions. As such, existing methods may suffer from their globally knowledge-shared mechanisms to handle the task heterogeneity. On the other hand, inherent task relation are not explicitly captured, making task knowledge unorganized and hard to transfer to new tasks. Thus, we explore a new FSTC setting where tasks can come from a diverse range of data sources. To address the task heterogeneity, we propose a self-supervised hierarchical task clustering (SS-HTC) method. SS-HTC not only customizes cluster-specific knowledge by dynamically organizing heterogeneous tasks into different clusters in hierarchical levels but also disentangles underlying relations between tasks to improve the interpretability. Extensive experiments on five public FSTC benchmark datasets demonstrate the effectiveness of SS-HTC.
CCPrompt: Counterfactual Contrastive Prompt-Tuning for Many-Class Classification
Li, Yang, Xu, Canran, Shen, Tao, Jiang, Jing, Long, Guodong
With the success of the prompt-tuning paradigm in Natural Language Processing (NLP), various prompt templates have been proposed to further stimulate specific knowledge for serving downstream tasks, e.g., machine translation, text generation, relation extraction, and so on. Existing prompt templates are mainly shared among all training samples with the information of task description. However, training samples are quite diverse. The sharing task description is unable to stimulate the unique task-related information in each training sample, especially for tasks with the finite-label space. To exploit the unique task-related information, we imitate the human decision process which aims to find the contrastive attributes between the objective factual and their potential counterfactuals. Thus, we propose the \textbf{C}ounterfactual \textbf{C}ontrastive \textbf{Prompt}-Tuning (CCPrompt) approach for many-class classification, e.g., relation classification, topic classification, and entity typing. Compared with simple classification tasks, these tasks have more complex finite-label spaces and are more rigorous for prompts. First of all, we prune the finite label space to construct fact-counterfactual pairs. Then, we exploit the contrastive attributes by projecting training instances onto every fact-counterfactual pair. We further set up global prototypes corresponding with all contrastive attributes for selecting valid contrastive attributes as additional tokens in the prompt template. Finally, a simple Siamese representation learning is employed to enhance the robustness of the model. We conduct experiments on relation classification, topic classification, and entity typing tasks in both fully supervised setting and few-shot setting. The results indicate that our model outperforms former baselines.