Text Classification
Semisupervised Text Classification Using Unsupervised Topic Information
Dorado, Rubén (École de Technologie Supérieure, Université du Québec) | Ratté, Sylvie
Labeling corpora is a time consuming and recurring problem while developing practical NLP applications. In this paper, we present a semi-supervised method to build a text classifier using unsupervised topic information. The objective is to use the least amount of labeled data to accelerate the creation of corpus for classification in specific domains. We show that it is possible to obtain a performance similar to state-of-the-art methods, despite the limited quantity of data.Labeling corpora is a time consuming and recurring problem while developing practical NLP applications. In this paper, we present a semi-supervised method to build a text classifier using unsupervised topic information. The objective is to use the least amount of labeled data to accelerate the creation of corpus for specific classification process. We show that it is possible to obtain a performance similar to state-of-the-art methods, despite the limited quantity of data.
Text Analysis 101; A Basic Understanding for Business Users: Clustering and Unsupervised Methods
This blog was originally posted as part of our Text Analysis 101 blog series. It aims to explain how the classification of text works as part of Natural Language Processing. It was the second blog on harnessing Machine Learning (ML) in the form of Natural Language Processing (NLP) for the Automatic Classification of documents. By classifying text, we aim to assign a document or piece of text to one or more classes or categories making it easier to manage or sort. A Document Classifier often returns or assigns a category "label" or "code" to a document or piece of text.
A judge has partially dismissed Twitter's surveillance case against the government
A California court has dismissed part of a lawsuit brought by Twitter that challenges U.S. government restrictions on what it can say about surveillance requests on its users. Twitter sued the government in 2014, alleging that the restrictions, which are common to all Internet service providers, infringe its First Amendment right to free speech. Earlier this year, the Department of Justice asked the federal district court in Oakland, California, to toss out the lawsuit. It argued that the Foreign Intelligence Surveillance Court (FISC) is a more suitable venue to hear the dispute, and that part of Twitter's argument didn't stand because the company isn't disputing document classification decisions made by the government. On Monday, a judge agreed with the government's latter argument but denied its request to shift the case to FISC.
Automatic Webpage Classification • /r/MachineLearning
I'm trying to create a document classifier but I'm not able to think of features to use. Anybody has experience with this? I used beautiful soup to remove the tags. I know tf-idf can be used, but not exactly sure how. Suggestions on how to'clean' the data better (eg removing stop words, stemming, etc) are also welcome.
Semi-Supervised Multinomial Naive Bayes for Text Classification by Leveraging Word-Level Statistical Constraint
Zhao, Li (Tsinghua University) | Huang, Minlie (Tsinghua University) | Yao, Ziyu (Beijing University of Posts and Telecommunications) | Su, Rongwei (Samsung Research and Development Institute China - Beijing) | Jiang, Yingying (Samsung Research and Development Institute China - Beijing) | Zhu, Xiaoyan (Tsinghua University)
Multinomial Naive Bayes with Expectation Maximization (MNB-EM) is a standard semi-supervised learning method to augment Multinomial Naive Bayes (MNB) for text classification. Despite its success, MNB-EM is not stable, and may succeed or fail to improve MNB. We believe that this is because MNB-EM lacks the ability to preserve the class distribution on words. In this paper, we propose a novel method to augment MNB-EM by leveraging the word-level statistical constraint to preserve the class distribution on words. The word-level statistical constraints are further converted to constraints on document posteriors generated by MNB-EM. Experiments demonstrate that our method can consistently improve MNB-EM, and outperforms state-of-art baselines remarkably.
Text Classification with Heterogeneous Information Network Kernels
Wang, Chenguang (Peking University) | Song, Yangqiu (West Virginia University) | Li, Haoran (Peking University) | Zhang, Ming (Peking University) | Han, Jiawei (University of Illinois at Urbana-Champaign)
Text classification is an important problem with many applications. Traditional approaches represent text as a bag-of-words and build classifiers based on this representation. Rather than words, entity phrases, the relations between the entities, as well as the types of the entities and relations carry much more information to represent the texts. This paper presents a novel text as network classification framework, which introduces 1) a structured and typed heterogeneous information networks (HINs) representation of texts, and 2) a meta-path based approach to link texts. We show that with the new representation and links of texts, the structured and typed information of entities and relations can be incorporated into kernels. Particularly, we develop both simple linear kernel and indefinite kernel based on meta-paths in the HIN representation of texts, where we call them HIN-kernels. Using Freebase, a well-known world knowledge base, to construct HIN for texts, our experiments on two benchmark datasets show that the indefinite HIN kernel based on weighted meta-paths outperforms the state-of-the-art methods and other HIN-kernels.
Robust Text Classification in the Presence of Confounding Bias
Landeiro, Virgile (Illinois Institute of Technology) | Culotta, Aron (Illinois Institute of Technology)
As text classifiers become increasingly used in real-time applications, it is critical to consider not only their accuracy but also their robustness to changes in the data distribution. In this paper, we consider the case where there is a confounding variable Z that influences both the text features X and the class variable Y. For example, a classifier trained to predict the health status of a user based on their online communications may be confounded by socioeconomic variables. When the influence of Z changes from training to testing data, we find that classifier accuracy can degrade rapidly. Our approach, based on Pearl's back-door adjustment, estimates the underlying effect of a text variable on the class variable while controlling for the confounding variable. Although our goal is prediction, not causal inference, we find that such adjustments are essential to building text classifiers that are robust to confounding variables. On three diverse text classifications tasks, we find that covariate adjustment results in higher accuracy than competing baselines over a range of confounding relationships (e.g., in one setting, accuracy improves from 60% to 81%).
Natural Language Processing for programmers: part 3 -- World Writable
Previously, I experimented with text generation using context-free grammars, one of the oldest techniques in natural language processing. I'll come back to CFGs in a future post. In this one I'm going to try my hand at classifiers. Automatic classification is the process by which a computer is trained to categorize an item into one or more defined buckets. A common type of classification is no doubt working on your behalf right this moment: spam filtering.
5 Text Classification Case Studies Using SciKit Learn
News Classification for Startup Intelligence: CB Insights, a startup intelligence data provider, shows an example of classifying news into HR & employee related classifications. CB Insights, a startup intelligence data provider, assessment of private company health includes tracking of their human resources activities. This includes programmatic monitoring of hiring activity as evidenced by job postings & key hires and departures. They used Sci-Kit learn to help in their activities. Human Resources classification is binary classification problem in the sense that the news should be able to discriminate human resources events that for companies from the all other news.
A Short Introduction to Using Word2Vec for Text Classification
Machine learning applications on natural language are an extremely important tool in the data scientist's toolbox. Use cases can include auto-detecting the language of a website, detecting spam in your spam filter, or auto-completing search queries. When you're working with text data, an important use case is text classification, where the data scientist is tasked with creating an algorithm that can figure out what a bit of text is all about (what is the tagline) based on what is written in the document. This can be used in a myriad of examples we see everyday, tagging things such as blog articles, app descriptions, and reviews. In many cases traditional text classification can be difficult to scale, because as the order of the taxonomy count increases, the amount of training required increases as well.