AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Projects In Machine Learning NLP for Text Classification with NLTK & Scikit-learn Eduonix

#artificialintelligenceJul-18-2018, 05:21:28 GMT

In this tutorial, we will cover Natural Language Processing for Text Classification with NLTK & Scikit-learn. Remember the last Natural Language Processing project we did? We will be using all that information to create a Spam filter. This tutorial will also cover Feature Engineering and ensemble NLP in text classification. This project will use Jupiter Notebook running Python 2.7.

artificial intelligence, natural language, text classification, (5 more...)

#artificialintelligence

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.92)

Add feedback

Orthogonal Matching Pursuit for Text Classification

Skianis, Konstantinos, Tziortziotis, Nikolaos, Vazirgiannis, Michalis

arXiv.org Machine LearningJul-12-2018

In text classification, the problem of overfitting arises due to the high dimensionality, making regularization essential. Although classic regularizers provide sparsity, they fail to return highly accurate models. On the contrary, state-of-the-art group-lasso regularizers provide better results at the expense of low sparsity. In this paper, we apply a greedy variable selection algorithm, called Orthogonal Matching Pursuit, for the text classification task. We also extend standard group OMP by introducing overlapping group OMP to handle overlapping groups of features. Empirical analysis verifies that both OMP and overlapping GOMP constitute powerful regularizers, able to produce effective and super-sparse models. Code and data are available here.

machine learning, natural language, text classification, (16 more...)

arXiv.org Machine Learning

1807.04715

Country:

North America > United States > New York > New York County > New York City (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Salesforce research

#artificialintelligenceJun-21-2018, 12:21:07 GMT

Deep learning has significantly improved state-of-the-art performance for natural language processing tasks like machine translation, summarization, question answering, and text classification. Each of these tasks is typically studied with a specific metric, and performance is often measured on a set of standard benchmark datasets. This has led to the development of architectures designed specifically for those tasks and metrics, but it does not necessarily promote the emergence of general NLP models, those which can perform well across a wide variety of NLP tasks. In order to explore the possibility of such models as well as the tradeoffs that arise in optimizing for them, we introduce the Natural Language Decathlon (decaNLP). The goal of the Decathlon is to explore models that generalize to all ten tasks and investigate how such models differ from those trained for single tasks.

machine learning, natural language, text classification, (19 more...)

#artificialintelligence

Industry: Information Technology > Software (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.38)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
(3 more...)

Add feedback

Investigating Capsule Networks with Dynamic Routing for Text Classification

Zhao, Wei, Ye, Jianbo, Yang, Min, Lei, Zeyang, Zhang, Suofei, Zhao, Zhou

arXiv.org Artificial IntelligenceJun-20-2018

In this study, we explore capsule networks with dynamic routing for text classification. We propose three strategies to stabilize the dynamic routing process to alleviate the disturbance of some noise capsules which may contain "background" information or have not been successfully trained. A series of experiments are conducted with capsule networks on six text classification benchmarks. Capsule networks achieve competitive results over the strong baseline methods on 4 out of 6 datasets, which shows the effectiveness of capsule networks for text classification. We additionally show that capsule networks exhibit significant improvement when transfer single-label to multi-label text classification over the strong competitors. To the best of our knowledge, this is the first work that capsule networks have been empirically investigated for text modeling.

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

1804.00538

Country:

Europe > United Kingdom (0.14)
Asia > China > Guangdong Province > Shenzhen (0.04)
North America > United States > Pennsylvania (0.04)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Banking & Finance > Economy (0.95)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Using J-K fold Cross Validation to Reduce Variance When Tuning NLP Models

Moss, Henry B., Leslie, David S., Rayson, Paul

arXiv.org Machine LearningJun-19-2018

K-fold cross validation (CV) is a popular method for estimating the true performance of machine learning models, allowing model selection and parameter tuning. However, the very process of CV requires random partitioning of the data and so our performance estimates are in fact stochastic, with variability that can be substantial for natural language processing tasks. We demonstrate that these unstable estimates cannot be relied upon for effective parameter tuning. The resulting tuned parameters are highly sensitive to how our data is partitioned, meaning that we often select sub-optimal parameter choices and have serious reproducibility issues. Instead, we propose to use the less variable J-K-fold CV, in which J independent K-fold cross validations are used to assess performance. Our main contributions are extending J-K-fold CV from performance estimation to parameter tuning and investigating how to choose J and K. We argue that variability is more important than bias for effective tuning and so advocate lower choices of K than are typically seen in the NLP literature, instead use the saved computation to increase J. To demonstrate the generality of our recommendations we investigate a wide range of case-studies: sentiment classification (both general and target-specific), part-of-speech tagging and document classification.

machine learning, natural language, variability, (17 more...)

arXiv.org Machine Learning

1806.07139

Country: North America > United States > New York (0.04)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.89)
(2 more...)

Add feedback

A Scalable Machine Learning Approach for Inferring Probabilistic US-LI-RADS Categorization

Banerjee, Imon, Choi, Hailey H., Desser, Terry, Rubin, Daniel L.

arXiv.org Artificial IntelligenceJun-15-2018

We propose a scalable computerized approach for large-scale inference of Liver Imaging Reporting and Data System (LI-RADS) final assessment categories in narrative ultrasound (US) reports. Although our model was trained on reports created using a LI-RADS template, it was also able to infer LI-RADS scoring for unstructured reports that were created before the LI-RADS guidelines were established. No human-labelled data was required in any step of this study; for training, LI-RADS scores were automatically extracted from those reports that contained structured LI-RADS scores, and it translated the derived knowledge to reasoning on unstructured radiology reports. By providing automated LI-RADS categorization, our approach may enable standardizing screening recommendations and treatment planning of patients at risk for hepatocellular carcinoma, and it may facilitate AI-based healthcare research with US images by offering large scale text mining and data gathering opportunities from standard hospital clinical data repositories.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

1806.07346

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > District of Columbia > Washington (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Hepatology (1.00)
Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Text Classification based on Word Subspace with Term-Frequency

Shimomoto, Erica K., Souza, Lincon S., Gatto, Bernardo B., Fukui, Kazuhiro

arXiv.org Machine LearningJun-8-2018

Text classification has become indispensable due to the rapid increase of text in digital form. Over the past three decades, efforts have been made to approach this task using various learning algorithms and statistical models based on bag-of-words (BOW) features. Despite its simple implementation, BOW features lack semantic meaning representation. To solve this problem, neural networks started to be employed to learn word vectors, such as the word2vec. Word2vec embeds word semantic structure into vectors, where the angle between vectors indicates the meaningful similarity between words. To measure the similarity between texts, we propose the novel concept of word subspace, which can represent the intrinsic variability of features in a set of word vectors. Through this concept, it is possible to model text from word vectors while holding semantic information. To incorporate the word frequency directly in the subspace model, we further extend the word subspace to the term-frequency (TF) weighted word subspace. Based on these new concepts, text classification can be performed under the mutual subspace method (MSM) framework. The validity of our modeling is shown through experiments on the Reuters text database, comparing the results to various state-of-art algorithms.

machine learning, natural language, text classification, (19 more...)

arXiv.org Machine Learning

1806.03125

Country:

Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.05)
North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.91)

Add feedback

Semi-supervised and Transfer learning approaches for low resource sentiment classification

Gupta, Rahul, Sahu, Saurabh, Espy-Wilson, Carol, Narayanan, Shrikanth

arXiv.org Machine LearningJun-7-2018

Sentiment classification involves quantifying the affective reaction of a human to a document, media item or an event. Although researchers have investigated several methods to reliably infer sentiment from lexical, speech and body language cues, training a model with a small set of labeled datasets is still a challenge. For instance, in expanding sentiment analysis to new languages and cultures, it may not always be possible to obtain comprehensive labeled datasets. In this paper, we investigate the application of semi-supervised and transfer learning methods to improve performances on low resource sentiment classification tasks. We experiment with extracting dense feature representations, pre-training and manifold regularization in enhancing the performance of sentiment classification systems. Our goal is a coherent implementation of these methods and we evaluate the gains achieved by these methods in matched setting involving training and testing on a single corpus setting as well as two cross corpora settings. In both the cases, our experiments demonstrate that the proposed methods can significantly enhance the model performance against a purely supervised approach, particularly in cases involving a handful of training data.

machine learning, natural language, text classification, (18 more...)

arXiv.org Machine Learning

1806.02863

Country: North America > United States > California (0.46)

Genre: Research Report (0.50)

Industry: Media (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

On the Importance of Attention in Meta-Learning for Few-Shot Text Classification

Jiang, Xiang, Havaei, Mohammad, Chartrand, Gabriel, Chouaib, Hassan, Vincent, Thomas, Jesson, Andrew, Chapados, Nicolas, Matwin, Stan

arXiv.org Machine LearningJun-3-2018

Current deep learning based text classification methods are limited by their ability to achieve fast learning and generalization when the data is scarce. We address this problem by integrating a meta-learning procedure that uses the knowledge learned across many tasks as an inductive bias towards better natural language understanding. Based on the Model-Agnostic Meta-Learning framework (MAML), we introduce the Attentive Task-Agnostic Meta-Learning (ATAML) algorithm for text classification. The essential difference between MAML and ATAML is in the separation of task-agnostic representation learning and task-specific attentive adaptation. The proposed ATAML is designed to encourage task-agnostic representation learning by way of task-agnostic parameterization and facilitate task-specific adaptation via attention mechanisms. We provide evidence to show that the attention mechanism in ATAML has a synergistic effect on learning performance. In comparisons with models trained from random initialization, pretrained models and meta trained MAML, our proposed ATAML method generalizes better on single-label and multi-label classification tasks in miniRCV1 and miniReuters-21578 datasets.

machine learning, natural language, text classification, (16 more...)

arXiv.org Machine Learning

1806.00852

Country:

North America > United States (0.69)
Asia > Middle East > Syria (0.15)

Genre: Research Report > New Finding (0.93)

Industry:

Government (1.00)
Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.93)

Add feedback

Intentional Control of Type I Error over Unconscious Data Distortion: a Neyman-Pearson Approach to Text Classification

Xia, Lucy, Zhao, Richard, Wu, Yanhui, Tong, Xin

arXiv.org Machine LearningJun-3-2018

Digital texts have become an increasingly important source of data for social studies. However, textual data from open platforms are vulnerable to manipulation (e.g., censorship and information inflation), often leading to bias in subsequent empirical analysis. This paper investigates the problem of data distortion in text classification when controlling type I error (a relevant textual message is classified as irrelevant) is the priority. The default classical classification paradigm that minimizes the overall classification error can yield an undesirably large type I error, and data distortion exacerbates this situation. As a solution, we propose the Neyman-Pearson (NP) classification paradigm which minimizes type II error under a user-specified type I error constraint. Theoretically, we show that while the classical oracle (i.e., optimal classifier) cannot be recovered under unknown data distortion even if one has the entire post-distortion population, the NP oracle is unaffected by data distortion and can be recovered under the same condition. Empirically, we illustrate the advantage of NP classification methods in a case study that classifies posts about strikes and corruption published on a leading Chinese blogging platform.

machine learning, natural language, text classification, (20 more...)

arXiv.org Machine Learning

1802.02558

Country:

Asia > China (1.00)
North America > United States > California (0.28)

Genre: Research Report (1.00)

Industry:

Law > Civil Rights & Constitutional Law (0.68)
Media > News (0.66)
Government > Regional Government (0.46)
Law > Criminal Law (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.70)
(2 more...)

Add feedback