Text Classification
Using AI to classify a book
We are going to work on a specific sub-task of NLP called text classification, this is the process of recognizing a pattern in a text and assign it a label. Examples that are used in your day to day life without you even noticing it include spam detection (in your mailbox), sentiment analysis (when you review a product or leave a comment) and tagging customer queries (when you fill in a contact form on a website). What we will try to do is to classify science-fiction books into different subgenres (dystopia, cyberpunk, space opera, โฆ) based on their plot. In the end, we want a model that is able to take a book plot as an input and output the subgenres detected in the text and the confidence of the model that a subgenre is detected. The demonstrator can take up to 1 minute to open because I use a free version of Heroku to host my app, thus it goes to sleep when nobody uses it and it's better for the planet! This kind of algorithms could help an online market place to classify the books they receive to make more performant recommendations or a librarian to organize originally the books by subgenres instead of alphabetically, to create an experience in the library. Data is one of the most important (if not the most important) thing in data science.
An Attention Ensemble Approach for Efficient Text Classification of Indian Languages
Kulkarni, Atharva, Hengle, Amey, Udyawar, Rutuja
The recent surge of complex attention-based deep learning architectures has led to extraordinary results in various downstream NLP tasks in the English language. However, such research for resource-constrained and morphologically rich Indian vernacular languages has been relatively limited. This paper proffers team SPPU\_AKAH's solution for the TechDOfication 2020 subtask-1f: which focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language. Availing the large dataset at hand, a hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification. Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57\% and f1-score of 0.8875. Furthermore, the solution resulted in the best system submission for this subtask, giving a test accuracy of 64.26\% and f1-score of 0.6157, transcending the performances of other teams as well as the baseline system given by the organizers of the shared task.
Adv-OLM: Generating Textual Adversaries via OLM
Malik, Vijit, Bhat, Ashwani, Modi, Ashutosh
Deep learning models are susceptible to adversarial examples that have imperceptible perturbations in the original input, resulting in adversarial attacks against these models. Analysis of these attacks on the state of the art transformers in NLP can help improve the robustness of these models against such adversarial inputs. In this paper, we present Adv-OLM, a black-box attack method that adapts the idea of Occlusion and Language Models (OLM) to the current state of the art attack methods. OLM is used to rank words of a sentence, which are later substituted using word replacement strategies. We experimentally show that our approach outperforms other attack methods for several text classification tasks.
Classification of Pedagogical content using conventional machine learning and deep learning model
Apuk, Vedat, Nuรงi, Krenare Pireva
Billions of users create a large amount of data every day, which in a sense comes from various types of sources. This data is in most cases unorganized and unclassified and is presented in various formats such as text, video, audio, or images. Processing and analyzing this data is a major challenge that we face every day. The problem of unstructured and unorganized text dates back to ancient times, but Text Classification as a discipline first appeared in the early 60s, where 30 years later the interest in various spheres for it increased [1], and began to be applied in various types of domains and applications such as for movie review [2], document classification [3], ecommerce [4], social media [5], online courses [6, 7], etc. As interest has grown more in the upcoming years, the uses start solving the problems with higher accurate results in more flexible ways. Knowledge Engineering (KE) was one of the applications of text classification in the late 80s, where the process took place by manually defining rules based on expert knowledge in terms of categorization of the document for a particular category [1]. After this time, there was a great wave of use of various modern and advanced methods for text classification, which all improved this discipline and made it more interesting for scientists and researchers, more specifically the use of machine learning techniques. These techniques bring a lot of advantages, as they are now in very large numbers, where they provide solutions to almost every problem we may encounter. The need for education and learning dates back to ancient times, where people are constantly improving and trying to gain as much knowledge as possible.
Explain and Predict, and then Predict again
Zhang, Zijian, Rudra, Koustav, Anand, Avishek
A desirable property of learning systems is to be both effective and interpretable. Towards this goal, recent models have been proposed that first generate an extractive explanation from the input text and then generate a prediction on just the explanation called explain-then-predict models. These models primarily consider the task input as a supervision signal in learning an extractive explanation and do not effectively integrate rationales data as an additional inductive bias to improve task performance. We propose a novel yet simple approach ExPred, that uses multi-task learning in the explanation generation phase effectively trading-off explanation and prediction losses. And then we use another prediction network on just the extracted explanations for optimizing the task performance. We conduct an extensive evaluation of our approach on three diverse language datasets -- fact verification, sentiment classification, and QA -- and find that we substantially outperform existing approaches.
Enhanced Twitter Sentiment Classification Using Contextual Information
Vosoughi, Soroush, Zhou, Helen, Roy, Deb
The rise in popularity and ubiquity of Twitter has made sentiment analysis of tweets an important and well-covered area of research. However, the 140 character limit imposed on tweets makes it hard to use standard linguistic methods for sentiment classification. On the other hand, what tweets lack in structure they make up with sheer volume and rich metadata. This metadata includes geolocation, temporal and author information. We hypothesize that sentiment is dependent on all these contextual factors. Different locations, times and authors have different emotional valences. In this paper, we explored this hypothesis by utilizing distant supervision to collect millions of labelled tweets from different locations, times and authors. We used this data to analyse the variation of tweet sentiments across different authors, times and locations. Once we explored and understood the relationship between these variables and sentiment, we used a Bayesian approach to combine these variables with more standard linguistic features such as n-grams to create a Twitter sentiment classifier. This combined classifier outperforms the purely linguistic classifier, showing that integrating the rich contextual information available on Twitter into sentiment classification is a promising direction of research.
Explaining NLP Models via Minimal Contrastive Editing (MiCE)
Ross, Alexis, Marasoviฤ, Ana, Peters, Matthew E.
Humans give contrastive explanations that explain why an observed event happened rather than some other counterfactual event (the contrast case). Despite the important role that contrastivity plays in how people generate and evaluate explanations, this property is largely missing from current methods for explaining NLP models. We present Minimal Contrastive Editing (MiCE), a method for generating contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks -- binary sentiment classification, topic classification, and multiple-choice question answering -- show that MiCE is able to produce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MiCE edits can be used for two use cases in NLP system development -- uncovering dataset artifacts and debugging incorrect model predictions -- and thereby illustrate that generating contrastive explanations is a promising research direction for model interpretability.
Explaining Black-box Models for Biomedical Text Classification
Moradi, Milad, Samwald, Matthias
In this paper, we propose a novel method named Biomedical Confident Itemsets Explanation (BioCIE), aiming at post-hoc explanation of black-box machine learning models for biomedical text classification. Using sources of domain knowledge and a confident itemset mining method, BioCIE discretizes the decision space of a black-box into smaller subspaces and extracts semantic relationships between the input text and class labels in different subspaces. Confident itemsets discover how biomedical concepts are related to class labels in the black-box's decision space. BioCIE uses the itemsets to approximate the black-box's behavior for individual predictions. Optimizing fidelity, interpretability, and coverage measures, BioCIE produces class-wise explanations that represent decision boundaries of the black-box. Results of evaluations on various biomedical text classification tasks and black-box models demonstrated that BioCIE can outperform perturbation-based and decision set methods in terms of producing concise, accurate, and interpretable explanations. BioCIE improved the fidelity of instance-wise and class-wise explanations by 11.6% and 7.5%, respectively. It also improved the interpretability of explanations by 8%. BioCIE can be effectively used to explain how a black-box biomedical text classification model semantically relates input texts to class labels. The source code and supplementary material are available at https://github.com/mmoradi-iut/BioCIE.
Natural Language Processing Text Classification
Classifying text data from a Data Source which consists of Movie Reviews. The processing of Text Data is mandatory before we start applying Machine Learning Techniques to them. We classified whether the Movie is having a positive or a negative rating by assigning them 1; if the rating is greater than 7 and 0 if the rating is less than 4. There are some unlabeled data that I did not include in my Analysis. The text_train is a list of length 25000, while I have printed the Reviews which consists of positive ratings(1).
Label Confusion Learning to Enhance Text Classification Models
Guo, Biyang, Han, Songqiao, Han, Xiao, Huang, Hailiang, Lu, Ting
Representing a true label as a one-hot vector is a common practice in training text classification models. However, the one-hot representation may not adequately reflect the relation between the instances and labels, as labels are often not completely independent and instances may relate to multiple labels in practice. The inadequate one-hot representations tend to train the model to be over-confident, which may result in arbitrary prediction and model overfitting, especially for confused datasets (datasets with very similar labels) or noisy datasets (datasets with labeling errors). While training models with label smoothing (LS) can ease this problem in some degree, it still fails to capture the realistic relation among labels. In this paper, we propose a novel Label Confusion Model (LCM) as an enhancement component to current popular text classification models. LCM can learn label confusion to capture semantic overlap among labels by calculating the similarity between instances and labels during training and generate a better label distribution to replace the original one-hot label vector, thus improving the final classification performance. Extensive experiments on five text classification benchmark datasets reveal the effectiveness of LCM for several widely used deep learning classification models. Further experiments also verify that LCM is especially helpful for confused or noisy datasets and superior to the label smoothing method.