Goto

Collaborating Authors

 Text Classification


A Text Classification Application: Poet Detection from Poetry

arXiv.org Machine Learning

With the widespread use of the internet, the size of the text data increases day by day. Poems can be given as an example of the growing text. In this study, we aim to classify poetry according to poet. Firstly, data set consisting of three different poetry of poets written in English have been constructed. Then, text categorization techniques are implemented on it. Chi-Square technique are used for feature selection. In addition, five different classification algorithms are tried. These algorithms are Sequential minimal optimization, Naive Bayes, C4.5 decision tree, Random Forest and k-nearest neighbors. Although each classifier showed very different results, over the 70% classification success rate was taken by sequential minimal optimization technique.


Revisiting Distributional Correspondence Indexing: A Python Reimplementation and New Experiments

arXiv.org Machine Learning

This paper introduces PyDCI, a new implementation of Distributional Correspondence Indexing (DCI) written in Python. DCI is a transfer learning method for cross-domain and cross-lingual text classification for which we had provided an implementation (here called JaDCI) built on top of JaTeCS, a Java framework for text classification. PyDCI is a stand-alone version of DCI that exploits scikit-learn and the SciPy stack. We here report on new experiments that we have carried out in order to test PyDCI, and in which we use as baselines new high-performing methods that have appeared after DCI was originally proposed. These experiments show that, thanks to a few subtle ways in which we have improved DCI, PyDCI outperforms both JaDCI and the above-mentioned high-performing methods, and delivers the best known results on the two popular benchmarks on which we had tested DCI, i.e., MultiDomainSentiment (a.k.a. MDS -- for cross-domain adaptation) and Webis-CLS-10 (for cross-lingual adaptation). PyDCI, together with the code allowing to replicate our experiments, is available at https://github.com/AlexMoreo/pydci .


Artificial Intelligence for Records Management RecordPoint

#artificialintelligence

As we discussed in the previous article, the Top 3 Challenges of Records Management, records management automation is the best way to address these challenges. But what is automation, really? Within these two main categories there are seven types of automation we typically deal with in the records management world. They can use fingerprinting, linguistic analysis, or both as methods of automation. All of them help us to classify content correctly against the file plan, and in some cases, we can build relationships between content for event better classification. This also helps us to enhance search and retrieval of information.


Explaining Black-Box Machine Learning Models - Code Part 2: Text classification with LIME

#artificialintelligence

Okay, our model above works but there are still common words and stop words in our model that LIME picks up on. Ideally, we would want to remove them before modeling and keep only relevant words. This we can accomplish by using additional steps and options in our preprocessing function. Important to know is that whatever preprocessing we do with our text corpus, train and test data has to have the same features (i.e. If we were to incorporate all the steps shown below into one function and call it separately on train and test data, we would end up with different words in our dtm and the predict() function won't work any more.


Counterfactual Fairness in Text Classification through Robustness

arXiv.org Machine Learning

In this paper, we study counterfactual fairness in text classification, which asks the question: How would the prediction change if the sensitive attribute discussed in the example were something else? We offer a heuristic for measuring this particular form of fairness in text classifiers by substituting individual tokens pertaining to attributes (e.g. sexual orientation, race, and religion), and describe the relationship with other notions, including individual and group fairness. Further, we offer methods, including hard ablation, blindness, and counterfactual logit pairing, for optimizing this counterfactual fairness metric during model training, bridging the robustness literature and the fairness literature. Empirically, counterfactual logit pairing performs as well as hard ablation and blindness to sensitive tokens, but generalizes better to unseen tokens. Interestingly, we find that in practice, the methods do not significantly harm classifier performance, and have varying tradeoffs with group fairness. These approaches, both for measurement and optimization, provide a new path forward for addressing counterfactual fairness issues.


Building a text classification model with TensorFlow Hub and Estimators

#artificialintelligence

We often see transfer learning applied to computer vision models, but what about using it for text classification? Enter TensorFlow Hub, a library for enhancing your TF models with transfer learning. Transfer learning is the process of taking the weights and variables of a pre-existing model that has already been trained on lots of data and leveraging it for your own data and prediction task. One of the many benefits of transfer learning is that you don't need to provide as much of your own training data as you would if you were starting from scratch. But where do these pre-existing models come from?


Automatic Judgment Prediction via Legal Reading Comprehension

arXiv.org Artificial Intelligence

Automatic judgment prediction aims to predict the judicial results based on case materials. It has been studied for several decades mainly by lawyers and judges, considered as a novel and prospective application of artificial intelligence techniques in the legal field. Most existing methods follow the text classification framework, which fails to model the complex interactions among complementary case materials. To address this issue, we formalize the task as Legal Reading Comprehension according to the legal scenario. Following the working protocol of human judges, LRC predicts the final judgment results based on three types of information, including fact description, plaintiffs' pleas, and law articles. Moreover, we propose a novel LRC model, AutoJudge, which captures the complex semantic interactions among facts, pleas, and laws. In experiments, we construct a real-world civil case dataset for LRC. Experimental results on this dataset demonstrate that our model achieves significant improvement over state-of-the-art models. We will publish all source codes and datasets of this work on \urlgithub.com for further research.


Graph Convolutional Networks for Text Classification

arXiv.org Artificial Intelligence

Text Classification is an important and classical problem in natural language processing. There have been a number of studies that applied convolutional neural networks (convolution on regular grid, e.g., sequence) to classification. However, only a limited number of studies have explored the more flexible graph convolutional neural networks (e.g., convolution on non-grid, e.g., arbitrary graph) for the task. In this work, we propose to use graph convolutional networks for text classification. We build a single text graph for a corpus based on word co-occurrence and document word relations, then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents. Our experimental results on multiple benchmark datasets demonstrate that a vanilla Text GCN without any external word embeddings or knowledge outperforms state-of-the-art methods for text classification. On the other hand, Text GCN also learns predictive word and document embeddings. In addition, experimental results show that the improvement of Text GCN over state-of-the-art comparison methods become more prominent as we lower the percentage of training data, suggesting the robustness of Text GCN to less training data in text classification.


Weakly-Supervised Neural Text Classification

arXiv.org Machine Learning

Deep neural networks are gaining increasing popularity for the classic text classification task, due to their strong expressive power and less requirement for feature engineering. Despite such attractiveness, neural text classification models suffer from the lack of training data in many real-world applications. Although many semi-supervised and weakly-supervised text classification models exist, they cannot be easily applied to deep neural models and meanwhile support limited supervision types. In this paper, we propose a weakly-supervised method that addresses the lack of training data in neural text classification. Our method consists of two modules: (1) a pseudo-document generator that leverages seed information to generate pseudo-labeled documents for model pre-training, and (2) a self-training module that bootstraps on real unlabeled data for model refinement. Our method has the flexibility to handle different types of weak supervision and can be easily integrated into existing deep neural models for text classification. We have performed extensive experiments on three real-world datasets from different domains. The results demonstrate that our proposed method achieves inspiring performance without requiring excessive training data and outperforms baseline methods significantly.


Machine Learning for Text Classification Using SpaCy in Python

#artificialintelligence

It provides current state-of-the-art accuracy and speed levels, and has an active open source community. However, since SpaCy is a relative new NLP library, and it's not as widely adopted as NLTK. There is not yet sufficient tutorials available. In this post, we will demonstrate how text classification can be implemented using spaCy without having any deep learning experience. It s often time consuming and frustrating experience for a young researcher to find and select a suitable academic conference to submit his (or her) academic papers. We define "suitable conference", meaning the conference is aligned with the researcher's work and have a good academic ranking.