Text Classification
Orthogonal Matching Pursuit for Text Classification
Skianis, Konstantinos, Tziortziotis, Nikolaos, Vazirgiannis, Michalis
In text classification, the problem of overfitting arises due to the high dimensionality, making regularization essential. Although classic regularizers provide sparsity, they fail to return highly accurate models. On the contrary, state-of-the-art group-lasso regularizers provide better results at the expense of low sparsity. In this paper, we apply a greedy variable selection algorithm, called Orthogonal Matching Pursuit, for the text classification task. We also extend standard group OMP by introducing overlapping group OMP to handle overlapping groups of features. Empirical analysis verifies that both OMP and overlapping GOMP constitute powerful regularizers, able to produce effective and super-sparse models. Code and data are available here.
Salesforce research
Deep learning has significantly improved state-of-the-art performance for natural language processing tasks like machine translation, summarization, question answering, and text classification. Each of these tasks is typically studied with a specific metric, and performance is often measured on a set of standard benchmark datasets. This has led to the development of architectures designed specifically for those tasks and metrics, but it does not necessarily promote the emergence of general NLP models, those which can perform well across a wide variety of NLP tasks. In order to explore the possibility of such models as well as the tradeoffs that arise in optimizing for them, we introduce the Natural Language Decathlon (decaNLP). The goal of the Decathlon is to explore models that generalize to all ten tasks and investigate how such models differ from those trained for single tasks.
Investigating Capsule Networks with Dynamic Routing for Text Classification
Zhao, Wei, Ye, Jianbo, Yang, Min, Lei, Zeyang, Zhang, Suofei, Zhao, Zhou
In this study, we explore capsule networks with dynamic routing for text classification. We propose three strategies to stabilize the dynamic routing process to alleviate the disturbance of some noise capsules which may contain "background" information or have not been successfully trained. A series of experiments are conducted with capsule networks on six text classification benchmarks. Capsule networks achieve competitive results over the strong baseline methods on 4 out of 6 datasets, which shows the effectiveness of capsule networks for text classification. We additionally show that capsule networks exhibit significant improvement when transfer single-label to multi-label text classification over the strong competitors. To the best of our knowledge, this is the first work that capsule networks have been empirically investigated for text modeling.
Using J-K fold Cross Validation to Reduce Variance When Tuning NLP Models
Moss, Henry B., Leslie, David S., Rayson, Paul
K-fold cross validation (CV) is a popular method for estimating the true performance of machine learning models, allowing model selection and parameter tuning. However, the very process of CV requires random partitioning of the data and so our performance estimates are in fact stochastic, with variability that can be substantial for natural language processing tasks. We demonstrate that these unstable estimates cannot be relied upon for effective parameter tuning. The resulting tuned parameters are highly sensitive to how our data is partitioned, meaning that we often select sub-optimal parameter choices and have serious reproducibility issues. Instead, we propose to use the less variable J-K-fold CV, in which J independent K-fold cross validations are used to assess performance. Our main contributions are extending J-K-fold CV from performance estimation to parameter tuning and investigating how to choose J and K. We argue that variability is more important than bias for effective tuning and so advocate lower choices of K than are typically seen in the NLP literature, instead use the saved computation to increase J. To demonstrate the generality of our recommendations we investigate a wide range of case-studies: sentiment classification (both general and target-specific), part-of-speech tagging and document classification.
A Scalable Machine Learning Approach for Inferring Probabilistic US-LI-RADS Categorization
Banerjee, Imon, Choi, Hailey H., Desser, Terry, Rubin, Daniel L.
We propose a scalable computerized approach for large-scale inference of Liver Imaging Reporting and Data System (LI-RADS) final assessment categories in narrative ultrasound (US) reports. Although our model was trained on reports created using a LI-RADS template, it was also able to infer LI-RADS scoring for unstructured reports that were created before the LI-RADS guidelines were established. No human-labelled data was required in any step of this study; for training, LI-RADS scores were automatically extracted from those reports that contained structured LI-RADS scores, and it translated the derived knowledge to reasoning on unstructured radiology reports. By providing automated LI-RADS categorization, our approach may enable standardizing screening recommendations and treatment planning of patients at risk for hepatocellular carcinoma, and it may facilitate AI-based healthcare research with US images by offering large scale text mining and data gathering opportunities from standard hospital clinical data repositories.
Text Classification based on Word Subspace with Term-Frequency
Shimomoto, Erica K., Souza, Lincon S., Gatto, Bernardo B., Fukui, Kazuhiro
Text classification has become indispensable due to the rapid increase of text in digital form. Over the past three decades, efforts have been made to approach this task using various learning algorithms and statistical models based on bag-of-words (BOW) features. Despite its simple implementation, BOW features lack semantic meaning representation. To solve this problem, neural networks started to be employed to learn word vectors, such as the word2vec. Word2vec embeds word semantic structure into vectors, where the angle between vectors indicates the meaningful similarity between words. To measure the similarity between texts, we propose the novel concept of word subspace, which can represent the intrinsic variability of features in a set of word vectors. Through this concept, it is possible to model text from word vectors while holding semantic information. To incorporate the word frequency directly in the subspace model, we further extend the word subspace to the term-frequency (TF) weighted word subspace. Based on these new concepts, text classification can be performed under the mutual subspace method (MSM) framework. The validity of our modeling is shown through experiments on the Reuters text database, comparing the results to various state-of-art algorithms.
Semi-supervised and Transfer learning approaches for low resource sentiment classification
Gupta, Rahul, Sahu, Saurabh, Espy-Wilson, Carol, Narayanan, Shrikanth
Sentiment classification involves quantifying the affective reaction of a human to a document, media item or an event. Although researchers have investigated several methods to reliably infer sentiment from lexical, speech and body language cues, training a model with a small set of labeled datasets is still a challenge. For instance, in expanding sentiment analysis to new languages and cultures, it may not always be possible to obtain comprehensive labeled datasets. In this paper, we investigate the application of semi-supervised and transfer learning methods to improve performances on low resource sentiment classification tasks. We experiment with extracting dense feature representations, pre-training and manifold regularization in enhancing the performance of sentiment classification systems. Our goal is a coherent implementation of these methods and we evaluate the gains achieved by these methods in matched setting involving training and testing on a single corpus setting as well as two cross corpora settings. In both the cases, our experiments demonstrate that the proposed methods can significantly enhance the model performance against a purely supervised approach, particularly in cases involving a handful of training data.
On the Importance of Attention in Meta-Learning for Few-Shot Text Classification
Jiang, Xiang, Havaei, Mohammad, Chartrand, Gabriel, Chouaib, Hassan, Vincent, Thomas, Jesson, Andrew, Chapados, Nicolas, Matwin, Stan
Current deep learning based text classification methods are limited by their ability to achieve fast learning and generalization when the data is scarce. We address this problem by integrating a meta-learning procedure that uses the knowledge learned across many tasks as an inductive bias towards better natural language understanding. Based on the Model-Agnostic Meta-Learning framework (MAML), we introduce the Attentive Task-Agnostic Meta-Learning (ATAML) algorithm for text classification. The essential difference between MAML and ATAML is in the separation of task-agnostic representation learning and task-specific attentive adaptation. The proposed ATAML is designed to encourage task-agnostic representation learning by way of task-agnostic parameterization and facilitate task-specific adaptation via attention mechanisms. We provide evidence to show that the attention mechanism in ATAML has a synergistic effect on learning performance. In comparisons with models trained from random initialization, pretrained models and meta trained MAML, our proposed ATAML method generalizes better on single-label and multi-label classification tasks in miniRCV1 and miniReuters-21578 datasets.
Intentional Control of Type I Error over Unconscious Data Distortion: a Neyman-Pearson Approach to Text Classification
Xia, Lucy, Zhao, Richard, Wu, Yanhui, Tong, Xin
Digital texts have become an increasingly important source of data for social studies. However, textual data from open platforms are vulnerable to manipulation (e.g., censorship and information inflation), often leading to bias in subsequent empirical analysis. This paper investigates the problem of data distortion in text classification when controlling type I error (a relevant textual message is classified as irrelevant) is the priority. The default classical classification paradigm that minimizes the overall classification error can yield an undesirably large type I error, and data distortion exacerbates this situation. As a solution, we propose the Neyman-Pearson (NP) classification paradigm which minimizes type II error under a user-specified type I error constraint. Theoretically, we show that while the classical oracle (i.e., optimal classifier) cannot be recovered under unknown data distortion even if one has the entire post-distortion population, the NP oracle is unaffected by data distortion and can be recovered under the same condition. Empirically, we illustrate the advantage of NP classification methods in a case study that classifies posts about strikes and corruption published on a leading Chinese blogging platform.
r/MachineLearning - [D] Text classification on a small dataset
I am trying to perform multiclass text classification (for 24 classes) on a set documents, but I have a very small dataset currently (1200 total examples). The data collection process is a bit tedious in my case, hence the small dataset size. The best result I have achieved till now is 58% accuracy with an SVM model and a single layer CNN model. Is there any other approach I can try other than collecting more data? I have tried oversampling the training set, but it didn't seem to improve the performance.