Goto

Collaborating Authors

Text Classification


Text Classification Using BERT & Tensorflow

#artificialintelligence

In this article, we are going to implement an email class classification whether it is spam or nonspam using BERT. Columns 2,3,4 contain no important data and can be deleted. Also, we rename column v1 as "category" and v2 as "message". Here we have 4825 ham and 747 spam emails. Creating a new column, if the value is spam the value of this spam column will be 1 and for ham, it will be zero.


Using Word Embeddings with TensorFlow for Movie Review Text Classification.

#artificialintelligence

This article will show how to use TensorFlow Embedding Layers to implement a movie review text classification. As we know, most machine learning algorithms cannot understand characters, words, or sentences. They can only take numbers as inputs. However, the nature of text data is unstructured and noisy, this characteristic makes it impossible to feed machine learning models directly with text data. There are many ways to convert text data into numerical features, and the process to follow will depend on the kind of feature engineering technique selected.


Text Classification: Predicting 'Good' or 'Bad' Statements using Natural Language Processing

#artificialintelligence

This blog will cover a very fundamental method of predicting whether the given input statement should be classified as'Good' or'Bad'. To do so we will first train a Natural Language Processing (NLP) model utilizing the past dataset. In this way, how about we begin! You should be aware of the BOW (Bag of Word) approach. You may check [1] out for more details.


Iterative Network Pruning with Uncertainty Regularization for Lifelong Sentiment Classification

arXiv.org Artificial Intelligence

Lifelong learning capabilities are crucial for sentiment classifiers to process continuous streams of opinioned information on the Web. However, performing lifelong learning is non-trivial for deep neural networks as continually training of incrementally available information inevitably results in catastrophic forgetting or interference. In this paper, we propose a novel iterative network pruning with uncertainty regularization method for lifelong sentiment classification (IPRLS), which leverages the principles of network pruning and weight regularization. By performing network pruning with uncertainty regularization in an iterative manner, IPRLS can adapta single BERT model to work with continuously arriving data from multiple domains while avoiding catastrophic forgetting and interference. Specifically, we leverage an iterative pruning method to remove redundant parameters in large deep networks so that the freed-up space can then be employed to learn new tasks, tackling the catastrophic forgetting problem. Instead of keeping the old-tasks fixed when learning new tasks, we also use an uncertainty regularization based on the Bayesian online learning framework to constrain the update of old tasks weights in BERT, which enables positive backward transfer, i.e. learning new tasks improves performance on past tasks while protecting old knowledge from being lost. In addition, we propose a task-specific low-dimensional residual function in parallel to each layer of BERT, which makes IPRLS less prone to losing the knowledge saved in the base BERT network when learning a new task. Extensive experiments on 16 popular review corpora demonstrate that the proposed IPRLS method sig-nificantly outperforms the strong baselines for lifelong sentiment classification. For reproducibility, we submit the code and data at:https://github.com/siat-nlp/IPRLS.


Label Mask for Multi-Label Text Classification

arXiv.org Artificial Intelligence

One of the key problems in multi-label text classification is how to take advantage of the correlation among labels. However, it is very challenging to directly model the correlations among labels in a complex and unknown label space. In this paper, we propose a Label Mask multi-label text classification model (LM-MTC), which is inspired by the idea of cloze questions of language model. LM-MTC is able to capture implicit relationships among labels through the powerful ability of pre-train language models. On the basis, we assign a different token to each potential label, and randomly mask the token with a certain probability to build a label based Masked Language Model (MLM). We train the MTC and MLM together, further improving the generalization ability of the model. A large number of experiments on multiple datasets demonstrate the effectiveness of our method.


SSMix: Saliency-Based Span Mixup for Text Classification

arXiv.org Artificial Intelligence

Data augmentation with mixup has shown to be effective on various computer vision tasks. Despite its great success, there has been a hurdle to apply mixup to NLP tasks since text consists of discrete tokens with variable length. In this work, we propose SSMix, a novel mixup method where the operation is performed on input text rather than on hidden vectors like previous approaches. SSMix synthesizes a sentence while preserving the locality of two original texts by span-based mixing and keeping more tokens related to the prediction relying on saliency information. With extensive experiments, we empirically validate that our method outperforms hidden-level mixup methods on a wide range of text classification benchmarks, including textual entailment, sentiment classification, and question-type classification. Our code is available at https://github.com/clovaai/ssmix.


Validating GAN-BioBERT: A Methodology For Assessing Reporting Trends In Clinical Trials

arXiv.org Machine Learning

In the past decade, there has been much discussion about the issue of biased reporting in clinical research. Despite this attention, there have been limited tools developed for the systematic assessment of qualitative statements made in clinical research, with most studies assessing qualitative statements relying on the use of manual expert raters, which limits their size. Also, previous attempts to develop larger scale tools, such as those using natural language processing, were limited by both their accuracy and the number of categories used for the classification of their findings. With these limitations in mind, this study's goal was to develop a classification algorithm that was both suitably accurate and finely grained to be applied on a large scale for assessing the qualitative sentiment expressed in clinical trial abstracts. Additionally, this study seeks to compare the performance of the proposed algorithm, GAN-BioBERT, to previous studies as well as to expert manual rating of clinical trial abstracts. This study develops a three-class sentiment classification algorithm for clinical trial abstracts using a semi-supervised natural language process model based on the Bidirectional Encoder Representation from Transformers (BERT) model, from a series of clinical trial abstracts annotated by a group of experts in academic medicine. Results: The use of this algorithm was found to have a classification accuracy of 91.3%, with a macro F1-Score of 0.92, which is a significant improvement in accuracy when compared to previous methods and expert ratings, while also making the sentiment classification finer grained than previous studies. The proposed algorithm, GAN-BioBERT, is a suitable classification model for the large-scale assessment of qualitative statements in clinical trial literature, providing an accurate, reproducible tool for the large-scale study of clinical publication trends.


Out-of-Manifold Regularization in Contextual Embedding Space for Text Classification

arXiv.org Artificial Intelligence

Recent studies on neural networks with pre-trained weights (i.e., BERT) have mainly focused on a low-dimensional subspace, where the embedding vectors computed from input words (or their contexts) are located. In this work, we propose a new approach to finding and regularizing the remainder of the space, referred to as out-of-manifold, which cannot be accessed through the words. Specifically, we synthesize the out-of-manifold embeddings based on two embeddings obtained from actually-observed words, to utilize them for fine-tuning the network. A discriminator is trained to detect whether an input embedding is located inside the manifold or not, and simultaneously, a generator is optimized to produce new embeddings that can be easily identified as out-of-manifold by the discriminator. These two modules successfully collaborate in a unified and end-to-end manner for regularizing the out-of-manifold. Our extensive evaluation on various text classification benchmarks demonstrates the effectiveness of our approach, as well as its good compatibility with existing data augmentation techniques which aim to enhance the manifold.


MathBERT: A Pre-Trained Model for Mathematical Formula Understanding

arXiv.org Artificial Intelligence

Large-scale pre-trained models like BERT, have obtained a great success in various Natural Language Processing (NLP) tasks, while it is still a challenge to adapt them to the math-related tasks. Current pre-trained models neglect the structural features and the semantic correspondence between formula and its context. To address these issues, we propose a novel pre-trained model, namely \textbf{MathBERT}, which is jointly trained with mathematical formulas and their corresponding contexts. In addition, in order to further capture the semantic-level structural features of formulas, a new pre-training task is designed to predict the masked formula substructures extracted from the Operator Tree (OPT), which is the semantic structural representation of formulas. We conduct various experiments on three downstream tasks to evaluate the performance of MathBERT, including mathematical information retrieval, formula topic classification and formula headline generation. Experimental results demonstrate that MathBERT significantly outperforms existing methods on all those three tasks. Moreover, we qualitatively show that this pre-trained model effectively captures the semantic-level structural information of formulas. To the best of our knowledge, MathBERT is the first pre-trained model for mathematical formula understanding.


Entailment as Few-Shot Learner

arXiv.org Artificial Intelligence

Large pre-trained language models (LMs) have demonstrated remarkable ability as few-shot learners. However, their success hinges largely on scaling model parameters to a degree that makes it challenging to train and serve. In this paper, we propose a new approach, named as EFL, that can turn small LMs into better few-shot learners. The key idea of this approach is to reformulate potential NLP task into an entailment one, and then fine-tune the model with as little as 8 examples. We further demonstrate our proposed method can be: (i) naturally combined with an unsupervised contrastive learning-based data augmentation method; (ii) easily extended to multilingual few-shot learning. A systematic evaluation on 18 standard NLP tasks demonstrates that this approach improves the various existing SOTA few-shot learning methods by 12\%, and yields competitive few-shot performance with 500 times larger models, such as GPT-3.