AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Weakly Supervised Attention Networks for Fine-Grained Opinion Mining and Public Health

Karamanolakis, Giannis, Hsu, Daniel, Gravano, Luis

arXiv.org Machine LearningSep-30-2019

In many review classification applications, a fine-grained analysis of the reviews is desirable, because different segments (e.g., sentences) of a review may focus on different aspects of the entity in question. However, training supervised models for segment-level classification requires segment labels, which may be more difficult or expensive to obtain than review labels. In this paper, we employ Multiple Instance Learning (MIL) and use only weak supervision in the form of a single label per review. First, we show that when inappropriate MIL aggregation functions are used, then MIL-based networks are outperformed by simpler baselines. Second, we propose a new aggregation function based on the sigmoid attention mechanism and show that our proposed model outperforms the state-of-the-art models for segment-level sentiment classification (by up to 9.8% in F1). Finally, we highlight the importance of fine-grained predictions in an important public-health application: finding actionable reports of foodborne illness. We show that our model achieves 48.6% higher recall compared to previous models, thus increasing the chance of identifying previously unknown foodborne outbreaks.

aggregation function, classification, sentiment classification, (15 more...)

arXiv.org Machine Learning

1910.00054

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Nevada (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > Experimental Study (0.46)
Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Public Health (0.61)
Health & Medicine > Epidemiology (0.53)
Health & Medicine > Therapeutic Area (0.47)
Food & Agriculture > Food Processing (0.37)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(4 more...)

Add feedback

r/MachineLearning - [R] Enriching BERT with Knowledge Graph Embeddings for Document Classification

#artificialintelligenceSep-19-2019, 15:52:55 GMT

In this paper, we focus on the classification of books using short descriptive texts (cover blurbs) and additional metadata. Building upon BERT, a deep neural language model, we demonstrate how to combine text representations with metadata and knowledge graph embeddings, which encode author information. Compared to the standard BERT approach we achieve considerably better results for the classification task. For a more coarse-grained classification using eight labels we achieve an F1- score of 87.20, while a detailed classification using 343 labels yields an F1-score of 64.70. We make the source code and trained models of our experiments publicly available.

document classification, knowledge graph embedding, machinelearning, (1 more...)

#artificialintelligence

Industry: Media > News (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.48)

Add feedback

Dual Adversarial Co-Learning for Multi-Domain Text Classification

Wu, Yuan, Guo, Yuhong

arXiv.org Machine LearningSep-18-2019

With the advent of deep learning, the performance of text classification models have been improved significantly. Nevertheless, the successful training of a good classification model requires a sufficient amount of labeled data, while it is always expensive and time consuming to annotate data. With the rapid growth of digital data, similar classification tasks can typically occur in multiple domains, while the availability of labeled data can largely vary across domains. Some domains may have abundant labeled data, while in some other domains there may only exist a limited amount (or none) of labeled data. Meanwhile text classification tasks are highly domain-dependent -- a text classifier trained in one domain may not perform well in another domain. In order to address these issues, in this paper we propose a novel dual adversarial co-learning approach for multi-domain text classification (MDTC). The approach learns shared-private networks for feature extraction and deploys dual adversarial regularizations to align features across different domains and between labeled and unlabeled data simultaneously under a discrepancy based co-learning framework, aiming to improve the classifiers' generalization capacity with the learned features. We conduct experiments on multi-domain sentiment classification datasets. The results show the proposed approach achieves the state-of-the-art MDTC performance.

classifier, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

1909.08203

Country: North America (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Authorship Analysis as a Text Classification or Clustering Problem

#artificialintelligenceSep-14-2019, 17:09:16 GMT

Many such'literary' quandaries are inspected by expert linguists as analysing and categorising discourses is fairly complex, domain-specific and highly multi-dimensional. One of latest research areas in Natural Language Processing is Authorship Analysis which is trying to leverage the computational power of big-data and artificial intelligence combined with linguistics and cognitive psychology to encode automatic classification of texts, identification of author profiles and resolution of authorship conflicts. This article is an attempt to introduce the concept of authorship analysis, its application areas and the major sub-tasks associated with it. The art and science of discriminating between writing styles of authors by identifying the characteristics of the persona of the authors and examining articles authored by them is called Authorship Analysis. Consequentially, it also aims to determine biographic characteristics of an individual like age, gender, native language and cognitive psychological traits based on "available information" pertaining to that individual. In this article, "available information" refers to textual data only in the context of authorship analysis, however, information in this context could go beyond textual format as it might also involve usage of multi-modal observations.

machine learning, natural language, text classification, (16 more...)

#artificialintelligence

Industry: Information Technology > Security & Privacy (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.42)

Add feedback

Spam filtering on forums: A synthetic oversampling based approach for imbalanced data classification

Ratadiya, Pratik, Moorthy, Rahul

arXiv.org Machine LearningSep-10-2019

Forums play an important role in providing a platform for community interaction. The introduction of irrelevant content or spam by individuals for commercial and social gains tends to degrade the professional experience presented to the forum users. Automated moderation of the relevancy of posted content is desired. Machine learning is used for text classification and finds applications in spam email detection, fraudulent transaction detection etc. The balance of classes in training data is essential in the case of classification algorithms to make the learning efficient and accurate. However, in the case of forums, the spam content is sparse compared to the relevant content giving rise to a bias towards the latter while training. A model trained on such biased data will fail to classify a spam sample. An approach based on Synthetic Minority Over-sampling Technique(SMOTE) is presented in this paper to tackle imbalanced training data. It involves synthetically creating new minority class samples from the existing ones until balance in data is achieved. The enhanced data is then passed through various classifiers for which the performance is recorded. The results were analyzed on the data of forums of Spoken Tutorial, IIT Bombay over standard performance metrics and revealed that models trained after Synthetic Minority oversampling outperform the ones trained on imbalanced data by substantial margins. An empirical comparison of the results obtained by both SMOTE and without SMOTE for various supervised classification algorithms have been presented in this paper. Synthetic oversampling proves to be a critical technique for achieving uniform class distribution which in turn yields commendable results in text classification. The presented approach can be further extended to content categorization on educational websites thus helping to improve the overall digital learning experience.

machine learning, natural language, text classification, (17 more...)

arXiv.org Machine Learning

1909.04826

Country: Asia > India (0.15)

Genre:

Research Report (0.51)
Instructional Material (0.47)

Industry:

Education (0.87)
Law Enforcement & Public Safety > Fraud (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.98)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.69)

Add feedback

Photometric light curves classification with machine learning

Gabruseva, Tatiana, Zlobin, Sergey, Wang, Peter

arXiv.org Machine LearningSep-9-2019

The Large Synoptic Survey Telescope will complete its survey in 2022 and produce terabytes of imaging data each night. To work with this massive onset of data, automated algorithms to classify astronomical light curves are crucial. Here, we present a method for automated classification of photometric light curves for a range of astronomical objects. Our approach is based on the gradient boosting of decision trees, feature extraction and selection, and augmentation. The solution was developed in the context of The Photometric LSST Astronomical Time Series Classification Challenge (PLAsTiCC) and achieved one of the top results in the challenge.

classification, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

1909.05032

Country: Europe (0.68)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.35)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.34)

Add feedback

Transformer to CNN: Label-scarce distillation for efficient text classification

Chia, Yew Ken, Witteveen, Sam, Andrews, Martin

arXiv.org Machine LearningSep-8-2019

Significant advances have been made in Natural Language Proc essing (NLP) modelling since the beginning of 2018. The new approaches allow for accurate results, even when there is little labelled data, because these NLP mo dels can benefit from training on both task-agnostic and task-specific unlabelle d data. However, these advantages come with significant size and computational cos ts. This workshop paper outlines how our proposed convolutiona l student architecture, having been trained by a distillation process from a la rge-scale model, can achieve 300 inference speedup and 39 reduction in parameter count. In some cases, the student model performance surpasses its teacher on the studied tasks.

classification, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

1909.03508

Country: North America > Canada (0.14)

Genre: Research Report (0.83)

Industry: Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

Introduction to Authorship Analysis as a Text Classification/Clustering Problem

#artificialintelligenceSep-2-2019, 20:19:23 GMT

The art and science of discriminating between writing styles of authors by identifying the characteristics of the persona of the authors and examining articles authored by them is called Authorship Analysis. It aims to determine characteristics of an individual like age, gender, native language and personality traits based on "available information" pertaining to that individual. In this article, "available information" refers to textual data only in the context of authorship analysis, however, information in this context could go beyond textual format as it might also involve usage of multi-modal observations. Multi-modal observations capture characteristic features such as voice, intonation, gestures, body posture and other physical behavioral aspects of an individual. A combination of all these characteristics reflects the persona of an individual and consequently helps in profiling that individual.

artificial intelligence, machine learning, natural language, (12 more...)

#artificialintelligence

Industry: Information Technology > Security & Privacy (0.32)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.45)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.43)

Add feedback

Out-of-Domain Detection for Low-Resource Text Classification Tasks

Tan, Ming, Yu, Yang, Wang, Haoyu, Wang, Dakuo, Potdar, Saloni, Chang, Shiyu, Yu, Mo

arXiv.org Machine LearningAug-31-2019

The goal is to detect the OOD cases with limited in-domain (ID) training data, since we observe that training data is often insufficient in machine learning applications. In this work, we propose an OOD-resistant Prototypical Network to tackle this zero-shot OOD detection and few-shot ID classification task. Evaluation on real-world datasets show that the proposed solution outperforms state-of-the-art methods in zero-shot OOD detection task, while maintaining a competitive performance on ID classification task. 1 Introduction Text classification tasks in real-world applications often consists of 2 components-In-Doman (ID) classification and Out-of-Domain (OOD) detection components (Liao et al., 2018; Kim and Kim, 2018; Shu et al., 2017; Shamekhi et al., 2018). ID classification refers to classifying a user's input with a label that exists in the training data, and OOD detection refers to designate a special OOD tag to the input when it does not belong to any of the labels in the ID training dataset (Dai et al., 2007). Recent state-of-the-art deep learning (DL) approaches for OOD detection and ID classification task often require massive amounts of ID or OOD labeled data (Kim and Kim, 2018).

machine learning, natural language, text classification, (18 more...)

arXiv.org Machine Learning

1909.05357

Country: Europe (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

Human-grounded Evaluations of Explanation Methods for Text Classification

Lertvittayakumjorn, Piyawat, Toni, Francesca

arXiv.org Artificial IntelligenceAug-29-2019

For text classification in particular, most of the existing explanation methods identify parts of the input text which contribute most towards the predicted class (so called attribution methods or relevance methods) by exploiting various techniques such as input perturbation (Li et al., 2016), gradient analysis (Dimopoulos et al., 1995), and relevance propagation (Arras et al., 2017b). Besides, there are other explanation methods designed for specific deep learning architectures such as attention mechanism (Ghaeini et al., 2018) and extrac-tive rationale generation (Lei et al., 2016). We select some well-known explanation methods (which are applicable to CNNs for text classification) and evaluate them together with two new explanation methods proposed in this paper.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

1908.11355

Country: Europe (0.28)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.92)

Add feedback