Text Classification
Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data
Apostolova, Emilia, Kreek, R. Andrew
Industry datasets used for text classification are rarely created for that purpose. In most cases, the data and target predictions are a by-product of accumulated historical data, typically fraught with noise, present in both the text-based document, as well as in the targeted labels. In this work, we address the question of how well performance metrics computed on noisy, historical data reflect the performance on the intended future machine learning model input. The results demonstrate the utility of dirty training datasets used to build prediction models for cleaner (and different) prediction inputs.
An Analysis of Hierarchical Text Classification Using Word Embeddings
Stein, Roger A., Jaques, Patricia A., Valiati, Joao F.
Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates the application of those models and algorithms on this specific problem by means of experimentation and analysis. We trained classification models with prominent machine learning algorithm implementations---fastText, XGBoost, SVM, and Keras' CNN---and noticeable word embeddings generation methods---GloVe, word2vec, and fastText---with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an ${}_{LCA}F_1$ of 0.893 on a single-labeled version of the RCV1 dataset. An analysis indicates that using word embeddings and its flavors is a very promising approach for HTC.
Hierarchical CVAE for Fine-Grained Hate Speech Classification
Qian, Jing, ElSherief, Mai, Belding, Elizabeth, Wang, William Yang
Existing work on automated hate speech detection typically focuses on binary classification or on differentiating among a small set of categories. In this paper, we propose a novel method on a fine-grained hate speech classification task, which focuses on differentiating among 40 hate groups of 13 different hate group categories. We first explore the Conditional Variational Autoencoder (CVAE) (Larsen et al., 2016; Sohn et al., 2015) as a discriminative model and then extend it to a hierarchical architecture to utilize the additional hate category information for more accurate prediction. Experimentally, we show that incorporating the hate category information for training can significantly improve the classification performance and our proposed model outperforms commonly-used discriminative models.
Building a Robust Text Classifier on a Test-Time Budget
Parvez, Md Rizwan, Bolukbasi, Tolga, Chang, kai-Wei, Saligrama, Venkatesh
In this paper, we study a generic learning framework for building robust text classification model that achieves accuracy comparable to standard full models under test-time budget constraints. Our approach learns a selector to identify words that are relevant to the prediction tasks and only passes these words to the classifier for processing. The selector is trained jointly with the classifier and directly learns to incorporate with the classifier. We further propose a data aggregation scheme to improve the robustness of the classifier. Our learning framework is general and can be incorporated with any type of text classification model. On real-world data, we show that the proposed approach improves the performance of a given classifier and speeds up the model with a mere loss in accuracy performance.
Distance Based Source Domain Selection for Sentiment Classification
Schultz, Lex Razoux, Loog, Marco, Esfahani, Peyman Mohajerin
Automated sentiment classification (SC) on short text fragments has received increasing attention in recent years. Performing SC on unseen domains with few or no labeled samples can significantly affect the classification performance due to different expression of sentiment in source and target domain. In this study, we aim to mitigate this undesired impact by proposing a methodology based on a predictive measure, which allows us to select an optimal source domain from a set of candidates. The proposed measure is a linear combination of well-known distance functions between probability distributions supported on the source and target domains (e.g. Earth Mover's distance and Kullback-Leibler divergence). The performance of the proposed methodology is validated through an SC case study in which our numerical experiments suggest a significant improvement in the cross domain classification error in comparison with a random selected source domain for both a naive and adaptive learning setting. In the case of more heterogeneous datasets, the predictability feature of the proposed model can be utilized to further select a subset of candidate domains, where the corresponding classifier outperforms the one trained on all available source domains. This observation reinforces a hypothesis that our proposed model may also be deployed as a means to filter out redundant information during a training phase of SC.
From Random to Supervised: A Novel Dropout Mechanism Integrated with Global Information
Xu, Hengru, Li, Shen, Hu, Renfen, Li, Si, Gao, Sheng
Dropout is used to avoid overfitting by randomly dropping units from the neural networks during training. Inspired by dropout, this paper presents GI-Dropout, a novel dropout method integrating with global information to improve neural networks for text classification. Unlike the traditional dropout method in which the units are dropped randomly according to the same probability, we aim to use explicit instructions based on global information of the dataset to guide the training process. With GI-Dropout, the model is supposed to pay more attention to inapparent features or patterns. Experiments demonstrate the effectiveness of the dropout with global information on seven text classification tasks, including sentiment analysis and topic classification.
Text Classification with Deep Neural Network in TensorFlow -- Simple Explanation
Text classification implementation with TensorFlow can be simple. One of the areas where text classification can be applied -- chatbot text processing and intent resolution. I will describe step by step in this post, how to build TensorFlow model for text classification and how classification is done. Please refer to my previous post related to similar topic -- Contextual Chatbot with TensorFlow, Node.js and Oracle JET -- Steps How to Install and Get It Working. I would recommend to go through this great post about chatbot implementation -- Contextual Chatbots with Tensorflow.
Ham or Spam? SMS Text Classification with Machine Learning
The use of mobile phones has skyrocketed in the last decade leading to a new area for junk promotions from disreptable marketers. People innocently give out their mobile phone numbers while utilizing day to day services and are then flooded with spam promotional messages. In this post we will take a look at classifying SMS messages using the Naive Bayes Machine Learning model, understand why Naive Bayes works well for this use case and also dive a little into wordclouds to visualize this dataset.
Step 2.5: Choose a Model ML Universal Guides Google Developers
At this point, we have assembled our dataset and gained insights into the key characteristics of our data. Next, based on the metrics we gathered in Step 2, we should think about which classification model we should use. This means/ asking questions such as, "How do we present the text data to an algorithm that expects numeric input?" (this is called data preprocessing and vectorization), "What type of model should we use?", "What configuration parameters should we use for our model?", Thanks to decades of research, we have access to a large array of data preprocessing and model configuration options. However, the availability of a very large array of viable options to choose from greatly increases the complexity and the scope of the particular problem at hand.
Projects In Machine Learning NLP for Text Classification with NLTK & Scikit-learn Eduonix
In this tutorial, we will cover Natural Language Processing for Text Classification with NLTK & Scikit-learn. Remember the last Natural Language Processing project we did? We will be using all that information to create a Spam filter. This tutorial will also cover Feature Engineering and ensemble NLP in text classification. This project will use Jupiter Notebook running Python 2.7.