"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.
I chose to use the AG news benchmark dataset. I recuperated the training and test test from John Snow Labs (a must see reference for all things NLP). This dataset is divided into four balanced categories for a total of 120,000 rows as seen below. The dataset is formatted into 2 columns, category and description. Because I want this to be a succinct post, I will refer you to my previous article to find out how to use Spark NLP in Colab.
We introduce a grey-box adversarial attack and defence framework for sentiment classification. We address the issues of differentiability, label preservation and input reconstruction for adversarial attack and defence in one unified framework. Our results show that once trained, the attacking model is capable of generating high-quality adversarial examples substantially faster (one order of magnitude less in time) than state-of-the-art attacking methods. These examples also preserve the original sentiment according to human evaluation. Additionally, our framework produces an improved classifier that is robust in defending against multiple adversarial attacking methods. Code is available at: https://github.com/ibm-aur-nlp/adv-def-text-dist.
In this part, we will try to understand the Encoder-Decoder architecture of the Multi-Head Self-Attention Transformer network with some code in PyTorch. There won't be any theory involved(better theoretical version can be found here) just the barebones of the network and how can one write this network on its own in PyTorch. The architecture comprising the Transformer model is divided into two parts -- the Encoder part and the Decoder part. Several other things combine to form the Encoder and Decoder parts. Let's start with the Encoder.
In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
We are going to work on a specific sub-task of NLP called text classification, this is the process of recognizing a pattern in a text and assign it a label. Examples that are used in your day to day life without you even noticing it include spam detection (in your mailbox), sentiment analysis (when you review a product or leave a comment) and tagging customer queries (when you fill in a contact form on a website). What we will try to do is to classify science-fiction books into different subgenres (dystopia, cyberpunk, space opera, …) based on their plot. In the end, we want a model that is able to take a book plot as an input and output the subgenres detected in the text and the confidence of the model that a subgenre is detected. The demonstrator can take up to 1 minute to open because I use a free version of Heroku to host my app, thus it goes to sleep when nobody uses it and it's better for the planet! This kind of algorithms could help an online market place to classify the books they receive to make more performant recommendations or a librarian to organize originally the books by subgenres instead of alphabetically, to create an experience in the library. Data is one of the most important (if not the most important) thing in data science.
The recent surge of complex attention-based deep learning architectures has led to extraordinary results in various downstream NLP tasks in the English language. However, such research for resource-constrained and morphologically rich Indian vernacular languages has been relatively limited. This paper proffers team SPPU\_AKAH's solution for the TechDOfication 2020 subtask-1f: which focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language. Availing the large dataset at hand, a hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification. Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57\% and f1-score of 0.8875. Furthermore, the solution resulted in the best system submission for this subtask, giving a test accuracy of 64.26\% and f1-score of 0.6157, transcending the performances of other teams as well as the baseline system given by the organizers of the shared task.
Deep learning models are susceptible to adversarial examples that have imperceptible perturbations in the original input, resulting in adversarial attacks against these models. Analysis of these attacks on the state of the art transformers in NLP can help improve the robustness of these models against such adversarial inputs. In this paper, we present Adv-OLM, a black-box attack method that adapts the idea of Occlusion and Language Models (OLM) to the current state of the art attack methods. OLM is used to rank words of a sentence, which are later substituted using word replacement strategies. We experimentally show that our approach outperforms other attack methods for several text classification tasks.
Billions of users create a large amount of data every day, which in a sense comes from various types of sources. This data is in most cases unorganized and unclassified and is presented in various formats such as text, video, audio, or images. Processing and analyzing this data is a major challenge that we face every day. The problem of unstructured and unorganized text dates back to ancient times, but Text Classification as a discipline first appeared in the early 60s, where 30 years later the interest in various spheres for it increased , and began to be applied in various types of domains and applications such as for movie review , document classification , ecommerce , social media , online courses [6, 7], etc. As interest has grown more in the upcoming years, the uses start solving the problems with higher accurate results in more flexible ways. Knowledge Engineering (KE) was one of the applications of text classification in the late 80s, where the process took place by manually defining rules based on expert knowledge in terms of categorization of the document for a particular category . After this time, there was a great wave of use of various modern and advanced methods for text classification, which all improved this discipline and made it more interesting for scientists and researchers, more specifically the use of machine learning techniques. These techniques bring a lot of advantages, as they are now in very large numbers, where they provide solutions to almost every problem we may encounter. The need for education and learning dates back to ancient times, where people are constantly improving and trying to gain as much knowledge as possible.
A desirable property of learning systems is to be both effective and interpretable. Towards this goal, recent models have been proposed that first generate an extractive explanation from the input text and then generate a prediction on just the explanation called explain-then-predict models. These models primarily consider the task input as a supervision signal in learning an extractive explanation and do not effectively integrate rationales data as an additional inductive bias to improve task performance. We propose a novel yet simple approach ExPred, that uses multi-task learning in the explanation generation phase effectively trading-off explanation and prediction losses. And then we use another prediction network on just the extracted explanations for optimizing the task performance. We conduct an extensive evaluation of our approach on three diverse language datasets -- fact verification, sentiment classification, and QA -- and find that we substantially outperform existing approaches.
The rise in popularity and ubiquity of Twitter has made sentiment analysis of tweets an important and well-covered area of research. However, the 140 character limit imposed on tweets makes it hard to use standard linguistic methods for sentiment classification. On the other hand, what tweets lack in structure they make up with sheer volume and rich metadata. This metadata includes geolocation, temporal and author information. We hypothesize that sentiment is dependent on all these contextual factors. Different locations, times and authors have different emotional valences. In this paper, we explored this hypothesis by utilizing distant supervision to collect millions of labelled tweets from different locations, times and authors. We used this data to analyse the variation of tweet sentiments across different authors, times and locations. Once we explored and understood the relationship between these variables and sentiment, we used a Bayesian approach to combine these variables with more standard linguistic features such as n-grams to create a Twitter sentiment classifier. This combined classifier outperforms the purely linguistic classifier, showing that integrating the rich contextual information available on Twitter into sentiment classification is a promising direction of research.