Text Classification
Weakly Supervised Attention Networks for Fine-Grained Opinion Mining and Public Health
Karamanolakis, Giannis, Hsu, Daniel, Gravano, Luis
In many review classification applications, a fine-grained analysis of the reviews is desirable, because different segments (e.g., sentences) of a review may focus on different aspects of the entity in question. However, training supervised models for segment-level classification requires segment labels, which may be more difficult or expensive to obtain than review labels. In this paper, we employ Multiple Instance Learning (MIL) and use only weak supervision in the form of a single label per review. First, we show that when inappropriate MIL aggregation functions are used, then MIL-based networks are outperformed by simpler baselines. Second, we propose a new aggregation function based on the sigmoid attention mechanism and show that our proposed model outperforms the state-of-the-art models for segment-level sentiment classification (by up to 9.8% in F1). Finally, we highlight the importance of fine-grained predictions in an important public-health application: finding actionable reports of foodborne illness. We show that our model achieves 48.6% higher recall compared to previous models, thus increasing the chance of identifying previously unknown foodborne outbreaks.
r/MachineLearning - [R] Enriching BERT with Knowledge Graph Embeddings for Document Classification
In this paper, we focus on the classification of books using short descriptive texts (cover blurbs) and additional metadata. Building upon BERT, a deep neural language model, we demonstrate how to combine text representations with metadata and knowledge graph embeddings, which encode author information. Compared to the standard BERT approach we achieve considerably better results for the classification task. For a more coarse-grained classification using eight labels we achieve an F1- score of 87.20, while a detailed classification using 343 labels yields an F1-score of 64.70. We make the source code and trained models of our experiments publicly available.
Dual Adversarial Co-Learning for Multi-Domain Text Classification
With the advent of deep learning, the performance of text classification models have been improved significantly. Nevertheless, the successful training of a good classification model requires a sufficient amount of labeled data, while it is always expensive and time consuming to annotate data. With the rapid growth of digital data, similar classification tasks can typically occur in multiple domains, while the availability of labeled data can largely vary across domains. Some domains may have abundant labeled data, while in some other domains there may only exist a limited amount (or none) of labeled data. Meanwhile text classification tasks are highly domain-dependent -- a text classifier trained in one domain may not perform well in another domain. In order to address these issues, in this paper we propose a novel dual adversarial co-learning approach for multi-domain text classification (MDTC). The approach learns shared-private networks for feature extraction and deploys dual adversarial regularizations to align features across different domains and between labeled and unlabeled data simultaneously under a discrepancy based co-learning framework, aiming to improve the classifiers' generalization capacity with the learned features. We conduct experiments on multi-domain sentiment classification datasets. The results show the proposed approach achieves the state-of-the-art MDTC performance.
Authorship Analysis as a Text Classification or Clustering Problem
Many such'literary' quandaries are inspected by expert linguists as analysing and categorising discourses is fairly complex, domain-specific and highly multi-dimensional. One of latest research areas in Natural Language Processing is Authorship Analysis which is trying to leverage the computational power of big-data and artificial intelligence combined with linguistics and cognitive psychology to encode automatic classification of texts, identification of author profiles and resolution of authorship conflicts. This article is an attempt to introduce the concept of authorship analysis, its application areas and the major sub-tasks associated with it. The art and science of discriminating between writing styles of authors by identifying the characteristics of the persona of the authors and examining articles authored by them is called Authorship Analysis. Consequentially, it also aims to determine biographic characteristics of an individual like age, gender, native language and cognitive psychological traits based on "available information" pertaining to that individual. In this article, "available information" refers to textual data only in the context of authorship analysis, however, information in this context could go beyond textual format as it might also involve usage of multi-modal observations.
Spam filtering on forums: A synthetic oversampling based approach for imbalanced data classification
Ratadiya, Pratik, Moorthy, Rahul
Forums play an important role in providing a platform for community interaction. The introduction of irrelevant content or spam by individuals for commercial and social gains tends to degrade the professional experience presented to the forum users. Automated moderation of the relevancy of posted content is desired. Machine learning is used for text classification and finds applications in spam email detection, fraudulent transaction detection etc. The balance of classes in training data is essential in the case of classification algorithms to make the learning efficient and accurate. However, in the case of forums, the spam content is sparse compared to the relevant content giving rise to a bias towards the latter while training. A model trained on such biased data will fail to classify a spam sample. An approach based on Synthetic Minority Over-sampling Technique(SMOTE) is presented in this paper to tackle imbalanced training data. It involves synthetically creating new minority class samples from the existing ones until balance in data is achieved. The enhanced data is then passed through various classifiers for which the performance is recorded. The results were analyzed on the data of forums of Spoken Tutorial, IIT Bombay over standard performance metrics and revealed that models trained after Synthetic Minority oversampling outperform the ones trained on imbalanced data by substantial margins. An empirical comparison of the results obtained by both SMOTE and without SMOTE for various supervised classification algorithms have been presented in this paper. Synthetic oversampling proves to be a critical technique for achieving uniform class distribution which in turn yields commendable results in text classification. The presented approach can be further extended to content categorization on educational websites thus helping to improve the overall digital learning experience.
Photometric light curves classification with machine learning
Gabruseva, Tatiana, Zlobin, Sergey, Wang, Peter
The Large Synoptic Survey Telescope will complete its survey in 2022 and produce terabytes of imaging data each night. To work with this massive onset of data, automated algorithms to classify astronomical light curves are crucial. Here, we present a method for automated classification of photometric light curves for a range of astronomical objects. Our approach is based on the gradient boosting of decision trees, feature extraction and selection, and augmentation. The solution was developed in the context of The Photometric LSST Astronomical Time Series Classification Challenge (PLAsTiCC) and achieved one of the top results in the challenge.
Transformer to CNN: Label-scarce distillation for efficient text classification
Chia, Yew Ken, Witteveen, Sam, Andrews, Martin
Significant advances have been made in Natural Language Proc essing (NLP) modelling since the beginning of 2018. The new approaches allow for accurate results, even when there is little labelled data, because these NLP mo dels can benefit from training on both task-agnostic and task-specific unlabelle d data. However, these advantages come with significant size and computational cos ts. This workshop paper outlines how our proposed convolutiona l student architecture, having been trained by a distillation process from a la rge-scale model, can achieve 300 inference speedup and 39 reduction in parameter count. In some cases, the student model performance surpasses its teacher on the studied tasks.
Introduction to Authorship Analysis as a Text Classification/Clustering Problem
The art and science of discriminating between writing styles of authors by identifying the characteristics of the persona of the authors and examining articles authored by them is called Authorship Analysis. It aims to determine characteristics of an individual like age, gender, native language and personality traits based on "available information" pertaining to that individual. In this article, "available information" refers to textual data only in the context of authorship analysis, however, information in this context could go beyond textual format as it might also involve usage of multi-modal observations. Multi-modal observations capture characteristic features such as voice, intonation, gestures, body posture and other physical behavioral aspects of an individual. A combination of all these characteristics reflects the persona of an individual and consequently helps in profiling that individual.
Out-of-Domain Detection for Low-Resource Text Classification Tasks
Tan, Ming, Yu, Yang, Wang, Haoyu, Wang, Dakuo, Potdar, Saloni, Chang, Shiyu, Yu, Mo
The goal is to detect the OOD cases with limited in-domain (ID) training data, since we observe that training data is often insufficient in machine learning applications. In this work, we propose an OOD-resistant Prototypical Network to tackle this zero-shot OOD detection and few-shot ID classification task. Evaluation on real-world datasets show that the proposed solution outperforms state-of-the-art methods in zero-shot OOD detection task, while maintaining a competitive performance on ID classification task. 1 Introduction Text classification tasks in real-world applications often consists of 2 components-In-Doman (ID) classification and Out-of-Domain (OOD) detection components (Liao et al., 2018; Kim and Kim, 2018; Shu et al., 2017; Shamekhi et al., 2018). ID classification refers to classifying a user's input with a label that exists in the training data, and OOD detection refers to designate a special OOD tag to the input when it does not belong to any of the labels in the ID training dataset (Dai et al., 2007). Recent state-of-the-art deep learning (DL) approaches for OOD detection and ID classification task often require massive amounts of ID or OOD labeled data (Kim and Kim, 2018).
Human-grounded Evaluations of Explanation Methods for Text Classification
Lertvittayakumjorn, Piyawat, Toni, Francesca
For text classification in particular, most of the existing explanation methods identify parts of the input text which contribute most towards the predicted class (so called attribution methods or relevance methods) by exploiting various techniques such as input perturbation (Li et al., 2016), gradient analysis (Dimopoulos et al., 1995), and relevance propagation (Arras et al., 2017b). Besides, there are other explanation methods designed for specific deep learning architectures such as attention mechanism (Ghaeini et al., 2018) and extrac-tive rationale generation (Lei et al., 2016). We select some well-known explanation methods (which are applicable to CNNs for text classification) and evaluate them together with two new explanation methods proposed in this paper.