AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Centroid estimation based on symmetric KL divergence for Multinomial text classification problem

Chen, Jiangning, Matzinger, Heinrich, Zhai, Haoyan, Zhou, Mi

arXiv.org Machine LearningOct-24-2018

We define a new method to estimate centroid for text classification based on the symmetric KL-divergence between the distribution of words in training documents and their class centroids. Experiments on several standard data sets indicate that the new method achieves substantial improvements over the traditional classifiers.

classification, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

1808.10261

Country: North America > United States (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.71)

Add feedback

A Text Classification Application: Poet Detection from Poetry

Sahin, Durmus Ozkan, Kural, Oguz Emre, Kilic, Erdal, Karabina, Armagan

arXiv.org Machine LearningOct-24-2018

With the widespread use of the internet, the size of the text data increases day by day. Poems can be given as an example of the growing text. In this study, we aim to classify poetry according to poet. Firstly, data set consisting of three different poetry of poets written in English have been constructed. Then, text categorization techniques are implemented on it. Chi-Square technique are used for feature selection. In addition, five different classification algorithms are tried. These algorithms are Sequential minimal optimization, Naive Bayes, C4.5 decision tree, Random Forest and k-nearest neighbors. Although each classifier showed very different results, over the 70% classification success rate was taken by sequential minimal optimization technique.

machine learning, natural language, processing and communication application conference, (16 more...)

arXiv.org Machine Learning

1810.11414

Country: Asia > Middle East > Republic of Türkiye (1.00)

Genre: Research Report (0.71)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.57)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)

Add feedback

Revisiting Distributional Correspondence Indexing: A Python Reimplementation and New Experiments

Moreo, Alejandro, Esuli, Andrea, Sebastiani, Fabrizio

arXiv.org Machine LearningOct-19-2018

This paper introduces PyDCI, a new implementation of Distributional Correspondence Indexing (DCI) written in Python. DCI is a transfer learning method for cross-domain and cross-lingual text classification for which we had provided an implementation (here called JaDCI) built on top of JaTeCS, a Java framework for text classification. PyDCI is a stand-alone version of DCI that exploits scikit-learn and the SciPy stack. We here report on new experiments that we have carried out in order to test PyDCI, and in which we use as baselines new high-performing methods that have appeared after DCI was originally proposed. These experiments show that, thanks to a few subtle ways in which we have improved DCI, PyDCI outperforms both JaDCI and the above-mentioned high-performing methods, and delivers the best known results on the two popular benchmarks on which we had tested DCI, i.e., MultiDomainSentiment (a.k.a. MDS -- for cross-domain adaptation) and Webis-CLS-10 (for cross-lingual adaptation). PyDCI, together with the code allowing to replicate our experiments, is available at https://github.com/AlexMoreo/pydci .

classification, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

1810.09311

Country: Europe (0.68)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Artificial Intelligence for Records Management RecordPoint

#artificialintelligenceSep-29-2018, 00:36:38 GMT

As we discussed in the previous article, the Top 3 Challenges of Records Management, records management automation is the best way to address these challenges. But what is automation, really? Within these two main categories there are seven types of automation we typically deal with in the records management world. They can use fingerprinting, linguistic analysis, or both as methods of automation. All of them help us to classify content correctly against the file plan, and in some cases, we can build relationships between content for event better classification. This also helps us to enhance search and retrieval of information.

machine learning, natural language, text classification, (14 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.38)

Add feedback

Explaining Black-Box Machine Learning Models - Code Part 2: Text classification with LIME

#artificialintelligenceSep-28-2018, 09:17:00 GMT

Okay, our model above works but there are still common words and stop words in our model that LIME picks up on. Ideally, we would want to remove them before modeling and keep only relevant words. This we can accomplish by using additional steps and options in our preprocessing function. Important to know is that whatever preprocessing we do with our text corpus, train and test data has to have the same features (i.e. If we were to incorporate all the steps shown below into one function and call it separately on train and test data, we would end up with different words in our dtm and the predict() function won't work any more.

artificial intelligence, natural language, text classification, (4 more...)

#artificialintelligence

Industry: Transportation > Air (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.40)

Add feedback

Counterfactual Fairness in Text Classification through Robustness

Garg, Sahaj, Perot, Vincent, Limtiaco, Nicole, Taly, Ankur, Chi, Ed H., Beutel, Alex

arXiv.org Machine LearningSep-27-2018

In this paper, we study counterfactual fairness in text classification, which asks the question: How would the prediction change if the sensitive attribute discussed in the example were something else? We offer a heuristic for measuring this particular form of fairness in text classifiers by substituting individual tokens pertaining to attributes (e.g. sexual orientation, race, and religion), and describe the relationship with other notions, including individual and group fairness. Further, we offer methods, including hard ablation, blindness, and counterfactual logit pairing, for optimizing this counterfactual fairness metric during model training, bridging the robustness literature and the fairness literature. Empirically, counterfactual logit pairing performs as well as hard ablation and blindness to sensitive tokens, but generalizes better to unseen tokens. Interestingly, we find that in practice, the methods do not significantly harm classifier performance, and have varying tradeoffs with group fairness. These approaches, both for measurement and optimization, provide a new path forward for addressing counterfactual fairness issues.

machine learning, natural language, text classification, (12 more...)

arXiv.org Machine Learning

1809.1061

Country: North America > United States > California > Santa Clara County (0.28)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.72)

Add feedback

Building a text classification model with TensorFlow Hub and Estimators

#artificialintelligenceSep-19-2018, 11:43:36 GMT

We often see transfer learning applied to computer vision models, but what about using it for text classification? Enter TensorFlow Hub, a library for enhancing your TF models with transfer learning. Transfer learning is the process of taking the weights and variables of a pre-existing model that has already been trained on lots of data and leveraging it for your own data and prediction task. One of the many benefits of transfer learning is that you don't need to provide as much of your own training data as you would if you were starting from scratch. But where do these pre-existing models come from?

machine learning, natural language, text classification, (7 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.64)

Add feedback

Automatic Judgment Prediction via Legal Reading Comprehension

Long, Shangbang, Tu, Cunchao, Liu, Zhiyuan, Sun, Maosong

arXiv.org Artificial IntelligenceSep-18-2018

Automatic judgment prediction aims to predict the judicial results based on case materials. It has been studied for several decades mainly by lawyers and judges, considered as a novel and prospective application of artificial intelligence techniques in the legal field. Most existing methods follow the text classification framework, which fails to model the complex interactions among complementary case materials. To address this issue, we formalize the task as Legal Reading Comprehension according to the legal scenario. Following the working protocol of human judges, LRC predicts the final judgment results based on three types of information, including fact description, plaintiffs' pleas, and law articles. Moreover, we propose a novel LRC model, AutoJudge, which captures the complex semantic interactions among facts, pleas, and laws. In experiments, we construct a real-world civil case dataset for LRC. Experimental results on this dataset demonstrate that our model achieves significant improvement over state-of-the-art models. We will publish all source codes and datasets of this work on \urlgithub.com for further research.

machine learning, natural language, text classification, (19 more...)

arXiv.org Artificial Intelligence

1809.06537

Country:

North America > United States (0.46)
Asia > China (0.29)

Genre: Research Report (1.00)

Industry:

Education > Assessment & Standards > Student Performance (0.64)
Law > Litigation (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.68)

Add feedback

Graph Convolutional Networks for Text Classification

Yao, Liang, Mao, Chengsheng, Luo, Yuan

arXiv.org Artificial IntelligenceSep-15-2018

Text Classification is an important and classical problem in natural language processing. There have been a number of studies that applied convolutional neural networks (convolution on regular grid, e.g., sequence) to classification. However, only a limited number of studies have explored the more flexible graph convolutional neural networks (e.g., convolution on non-grid, e.g., arbitrary graph) for the task. In this work, we propose to use graph convolutional networks for text classification. We build a single text graph for a corpus based on word co-occurrence and document word relations, then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents. Our experimental results on multiple benchmark datasets demonstrate that a vanilla Text GCN without any external word embeddings or knowledge outperforms state-of-the-art methods for text classification. On the other hand, Text GCN also learns predictive word and document embeddings. In addition, experimental results show that the improvement of Text GCN over state-of-the-art comparison methods become more prominent as we lower the percentage of training data, suggesting the robustness of Text GCN to less training data in text classification.

machine learning, natural language, text classification, (15 more...)

arXiv.org Artificial Intelligence

1809.05679

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (0.88)
Research Report > Experimental Study (0.68)

Industry: Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Weakly-Supervised Neural Text Classification

Meng, Yu, Shen, Jiaming, Zhang, Chao, Han, Jiawei

arXiv.org Machine LearningSep-12-2018

Deep neural networks are gaining increasing popularity for the classic text classification task, due to their strong expressive power and less requirement for feature engineering. Despite such attractiveness, neural text classification models suffer from the lack of training data in many real-world applications. Although many semi-supervised and weakly-supervised text classification models exist, they cannot be easily applied to deep neural models and meanwhile support limited supervision types. In this paper, we propose a weakly-supervised method that addresses the lack of training data in neural text classification. Our method consists of two modules: (1) a pseudo-document generator that leverages seed information to generate pseudo-labeled documents for model pre-training, and (2) a self-training module that bootstraps on real unlabeled data for model refinement. Our method has the flexibility to handle different types of weak supervision and can be easily integrated into existing deep neural models for text classification. We have performed extensive experiments on three real-world datasets from different domains. The results demonstrate that our proposed method achieves inspiring performance without requiring excessive training data and outperforms baseline methods significantly.

machine learning, natural language, text classification, (15 more...)

arXiv.org Machine Learning

doi: 10.1145/3269206.3271737

1809.01478

Country:

Europe > Italy > Piedmont > Turin Province > Turin (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Leisure & Entertainment > Sports (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback