Text Classification
Keras documentation: Text classification from scratch
Authors: Mark Omernick, Francois Chollet Date created: 2019/11/06 Last modified: 2020/05/17 Description: Text sentiment classification starting from raw text files. This example shows how to do text classification starting from raw text (as a set of text files on disk). We demonstrate the workflow on the IMDB sentiment classification dataset (unprocessed version). We use the TextVectorization layer for word splitting & indexing. Let's download the data and inspect its structure.
Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting
Zhang, Guanhua, Bai, Bing, Zhang, Junqi, Bai, Kun, Zhu, Conghui, Zhao, Tiejun
With the recent proliferation of the use of text classifications, researchers have found that there are certain unintended biases in text classification datasets. For example, texts containing some demographic identity-terms (e.g., "gay", "black") are more likely to be abusive in existing abusive language detection datasets. As a result, models trained with these datasets may consider sentences like "She makes me happy to be gay" as abusive simply because of the word "gay." In this paper, we formalize the unintended biases in text classification datasets as a kind of selection bias from the non-discrimination distribution to the discrimination distribution. Based on this formalization, we further propose a model-agnostic debiasing training framework by recovering the non-discrimination distribution using instance weighting, which does not require any extra resources or annotations apart from a pre-defined set of demographic identity-terms. Experiments demonstrate that our method can effectively alleviate the impacts of the unintended biases without significantly hurting models' generalization ability.
HeteGCN: Heterogeneous Graph Convolutional Networks for Text Classification
Ragesh, Rahul, Sellamanickam, Sundararajan, Iyer, Arun, Bairi, Ram, Lingam, Vijay
We consider the problem of learning efficient and inductive graph convolutional networks for text classification with a large number of examples and features. Existing state-of-the-art graph embedding based methods such as predictive text embedding (PTE) and TextGCN have shortcomings in terms of predictive performance, scalability and inductive capability. To address these limitations, we propose a heterogeneous graph convolutional network (HeteGCN) modeling approach that unites the best aspects of PTE and TextGCN together. The main idea is to learn feature embeddings and derive document embeddings using a HeteGCN architecture with different graphs used across layers. We simplify TextGCN by dissecting into several HeteGCN models which (a) helps to study the usefulness of individual models and (b) offers flexibility in fusing learned embeddings from different models. In effect, the number of model parameters is reduced significantly, enabling faster training and improving performance in small labeled training set scenario. Our detailed experimental studies demonstrate the efficacy of the proposed approach.
Text Classification with Hugging Face Transformers in TensorFlow 2 (Without Tears)
The Hugging Face transformers package is an immensely popular Python library providing pretrained models that are extraordinarily useful for a variety of natural language processing (NLP) tasks. It previously supported only PyTorch, but, as of late 2019, TensorFlow 2 is supported as well. While the library can be used for many tasks from Natural Language Inference (NLI) to Question-Answering, text classification remains one of the most popular and practical use cases. The ktrain library is a lightweight wrapper for tf.keras in TensorFlow 2. It is designed to make deep learning and AI more accessible and easier to apply for beginners and domain experts. As of version 0.8, ktrain now includes a simplified interface to Hugging Face transformers for text classification.
Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Clusters for Extreme Multi-label Text Classification
Ye, Hui, Chen, Zhiyu, Wang, Da-Han, Davison, Brian D.
Extreme multi-label text classification (XMTC) is a task for tagging a given text with the most relevant labels from an extremely large label set. We propose a novel deep learning method called APLC-XLNet. Our approach fine-tunes the recently released generalized autoregressive pretrained model (XLNet) to learn a dense representation for the input text. We propose Adaptive Probabilistic Label Clusters (APLC) to approximate the cross entropy loss by exploiting the unbalanced label distribution to form clusters that explicitly reduce the computational time. Our experiments, carried out on five benchmark datasets, show that our approach has achieved new state-of-the-art results on four benchmark datasets. Our source code is available publicly at https://github.com/huiyegit/APLC_XLNet.
Text-based classification of interviews for mental health -- juxtaposing the state of the art
Currently, the state of the art for classification of psychiatric illness is based on audio-based classification. This thesis aims to design and evaluate a state of the art text classification network on this challenge. The hypothesis is that a well designed text-based approach poses a strong competition against the state-of-the-art audio based approaches. Dutch natural language models are being limited by the scarcity of pre-trained monolingual NLP models, as a result Dutch natural language models have a low capture of long range semantic dependencies over sentences. For this issue, this thesis presents belabBERT, a new Dutch language model extending the RoBERTa[15] architecture. belabBERT is trained on a large Dutch corpus (+32GB) of web crawled texts. After this thesis evaluates the strength of text-based classification, a brief exploration is done, extending the framework to a hybrid text- and audio-based classification. The goal of this hybrid framework is to show the principle of hybridisation with a very basic audio-classification network. The overall goal is to create the foundations for a hybrid psychiatric illness classification, by proving that the new text-based classification is already a strong stand-alone solution.
Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT
In this article, using NLP and Python, I will explain 3 different strategies for text multiclass classification: the old-fashioned Bag-of-Words (with Tf-Idf), the famous Word Embedding (with Word2Vec), and the cutting edge Language models (with BERT). NLP (Natural Language Processing) is the field of artificial intelligence that studies the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. NLP is often applied for classifying text data. Text classification is the problem of assigning categories to text data according to its content. There are different techniques to extract information from raw text data and use it to train a classification model.
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions
Ye, Mao, Gong, Chengyue, Liu, Qiang
State-of-the-art NLP models can often be fooled by human-unaware transformations such as synonymous word substitution. For security reasons, it is of critical importance to develop models with certified robustness that can provably guarantee that the prediction is can not be altered by any possible synonymous word substitution. In this work, we propose a certified robust method based on a new randomized smoothing technique, which constructs a stochastic ensemble by applying random word substitutions on the input sentences, and leverage the statistical properties of the ensemble to provably certify the robustness. Our method is simple and structure-free in that it only requires the black-box queries of the model outputs, and hence can be applied to any pre-trained models (such as BERT) and any types of models (world-level or subword-level). Our method significantly outperforms recent state-of-the-art methods for certified robustness on both IMDB and Amazon text classification tasks. To the best of our knowledge, we are the first work to achieve certified robustness on large systems such as BERT with practically meaningful certified accuracy.
Comparing BERT against traditional machine learning text classification
González-Carvajal, Santiago, Garrido-Merchán, Eduardo C.
The BERT model has arisen as a popular state-of-the-art machine learning model in the recent years that is able to cope with multiple NLP tasks such as supervised text classification without human supervision. Its flexibility to cope with any type of corpus delivering great results has make this approach very popular not only in academia but also in the industry. Although, there are lots of different approaches that have been used throughout the years with success. In this work, we first present BERT and include a little review on classical NLP approaches. Then, we empirically test with a suite of experiments dealing different scenarios the behaviour of BERT against the traditional TF-IDF vocabulary fed to machine learning algorithms. Our purpose of this work is to add empirical evidence to support or refuse the use of BERT as a default on NLP tasks. Experiments show the superiority of BERT and its independence of features of the NLP problem such as the language of the text adding empirical evidence to use BERT as a default technique to be used in NLP problems.