Text Classification
Character-level Convolutional Networks for Text Classification
One of the common natural language understanding problems is text classification. Over last few decades, machine learning researchers have been moving from the simplest "bag of words" model to more sophisticated models for text classification. Bag of words model uses only information about which words are used in the text. Adding TFIDF to the bag of words helps to track relevancy of each word to the document. Bag of n-grams enables using partial information about structure of the text. Recurrent neural networks, like LSTM, can capture dependencies between words even if they are far from each other.
Improving Multi-Document Summarization via Text Classification
Cao, Ziqiang (The Hong Kong Polytechnic University) | Li, Wenjie (The Hong Kong Polytechnic University) | Li, Sujian (Peking University) | Wei, Furu (Microsoft Research)
Developed so far, multi-document summarization has reached its bottleneck due to the lack of sufficient training data and diverse categories of documents. Text classification just makes up for these deficiencies. In this paper, we propose a novel summarization system called TCSum, which leverages plentiful text classification data to improve the performance of multi-document summarization. TCSum projects documents onto distributed representations which act as a bridge between text classification and summarization. It also utilizes the classification results to produce summaries of different styles. Extensive experiments on DUC generic multi-document summarization datasets show that, TCSum can achieve the state-of-the-art performance without using any hand-crafted features and has the capability to catch the variations of summary styles with respect to different text categories.
Wikitop: Using Wikipedia Category Network to Generate Topic Trees
Kumar, Saravana (College of Engineering, Guindy) | Rengarajan, Prasath (College of Engineering, Guindy) | Annie, Arockia Xavier (College of Engineering, Guindy)
Automated topic identification is an essential component invarious information retrieval and knowledge representationtasks such as automated summary generation, categorization search and document indexing. In this paper, we present the Wikitop system to automatically generate topic trees from the input text by performing hierarchical classification using the Wikipedia Category Network (WCN). Our preliminary results over a collection of 125 articles are encouraging and show potential of a robust methodology for automated topic tree generation.
Cross-Domain Sentiment Classification via Topic-Related TrAdaBoost
Huang, Xingchang (Sun Yat-sen University) | Rao, Yanghui (Sun Yat-sen University) | Xie, Haoran (The Education University of Hong Kong) | Wong, Tak-Lam (The Education University of Hong Kong) | Wang, Fu Lee (Caritas Institute of Higher Education)
Cross-domain sentiment classification aims to tag sentiments for a target domain by labeled data from a source domain. Due to the difference between domains, the accuracy of a trained classifier may be very low. In this paper, we propose a boosting-based learning framework named TR-TrAdaBoost for cross-domain sentiment classification. We firstly explore the topic distribution of documents, and then combine it with the unigram TrAdaBoost. The topic distribution captures the domain information of documents, which is valuable for cross-domain sentiment classification. Experimental results indicate that TR-TrAdaBoost represents documents well and boost the performance and robustness of TrAdaBoost.
Active Discriminative Text Representation Learning
Zhang, Ye (University of Texas at Austin) | Lease, Matthew (University of Texas at Austin) | Wallace, Byron C. (Northeastern University)
We propose a new active learning (AL) method for text classification with convolutional neural networks (CNNs). In AL, one selects the instances to be manually labeled with the aim of maximizing model performance with minimal effort. Neural models capitalize on word embeddings as representations (features), tuning these to the task at hand. We argue that AL strategies for multi-layered neural models should focus on selecting instances that most affect the embedding space (i.e., induce discriminative word representations). This is in contrast to traditional AL approaches (e.g., entropy-based uncertainty sampling), which specify higher level objectives. We propose a simple approach for sentence classification that selects instances containing words whose embeddings are likely to be updated with the greatest magnitude, thereby rapidly learning discriminative, task-specific embeddings. We extend this approach to document classification by jointly considering: (1) the expected changes to the constituent word representations; and (2) the model’s current overall uncertainty regarding the instance. The relative emphasis placed on these criteria is governed by a stochastic process that favors selecting instances likely to improve representations at the outset of learning, and then shifts toward general uncertainty sampling as AL progresses. Empirical results show that our method outperforms baseline AL approaches on both sentence and document classification tasks. We also show that, as expected, the method quickly learns discriminative word embeddings. To the best of our knowledge, this is the first work on AL addressing neural models for text classification.
Semi-Supervised Multi-View Correlation Feature Learning with Application to Webpage Classification
Jing, Xiao-Yuan (Wuhan University; Nanjing University of Posts and Telecommunications) | Wu, Fei (Nanjing University of Posts and Telecommunications) | Dong, Xiwei (Nanjing University of Posts and Telecommunications) | Shan, Shiguang (Chinese Academy of Sciences (CAS)) | Chen, Songcan (Nanjing University of Aeronautics and Astronautics)
Webpage classification has attracted a lot of research interest. Webpage data is often multi-view and high-dimensional, and the webpage classification application is usually semi-supervised. Due to these characteristics, using semi-supervised multi-view feature learning (SMFL) technique to deal with the webpage classification problem has recently received much attention. However, there still exists room for improvement for this kind of feature learning technique. How to effectively utilize the correlation information among multi-view of webpage data is an important research topic. Correlation analysis on multi-view data can facilitate extraction of the complementary information. In this paper, we propose a novel SMFL approach, named semi-supervised multi-view correlation feature learning (SMCFL), for webpage classification. SMCFL seeks for a discriminant common space by learning a multi-view shared transformation in a semi-supervised manner. In the discriminant space, the correlation between intra-class samples is maximized, and the correlation between inter-class samples and the global correlation among both labeled and unlabeled samples are minimized simultaneously. We transform the matrix-variable based nonconvex objective function of SMCFL into a convex quadratic programming problem with one real variable, and can achieve a global optimal solution. Experiments on widely used datasets demonstrate the effectiveness and efficiency of the proposed approach.
Naive Bayesian Text Classification
Paul Graham popularized the term "Bayesian Classification" (or more accurately "Naïve Bayesian Classification") after his "A Plan for Spam" article was published (http://www.paulgraham.com/spam.html). In fact, text classifiers based on naïve Bayesian and other techniques have been around for many years. Companies such as Autonomy and Interwoven incorporate machine-learning techniques to automatically classify documents of all kinds; one such machine-learning technique is naïve Bayesian text classification. Naïve Bayesian text classifiers are fast, accurate, simple, and easy to implement. In this article, I present a complete naïve Bayesian text classifier written in 100 lines of commented, nonobfuscated Perl.
Supervised Word Mover's Distance
Huang, Gao, Guo, Chuan, Kusner, Matt J., Sun, Yu, Sha, Fei, Weinberger, Kilian Q.
Accurately measuring the similarity between text documents lies at the core of many real world applications of machine learning. These include web-search ranking, document recommendation, multi-lingual document matching, and article categorization. Recently, a new document metric, the word mover's distance (WMD), has been proposed with unprecedented results on kNN-based document classification. The WMD elevates high quality word embeddings to document metrics by formulating the distance between two documents as an optimal transport problem between the embedded words. However, the document distances are entirely unsupervised and lack a mechanism to incorporate supervision when available. In this paper we propose an efficient technique to learn a supervised metric, which we call the Supervised WMD (S-WMD) metric. Our algorithm learns document distances that measure the underlying semantic differences between documents by leveraging semantic differences between individual words discovered during supervised training. This is achieved with an linear transformation of the underlying word embedding space and tailored word-specific weights, learned to minimize the stochastic leave-one-out nearest neighbor classification error on a per-document level. We evaluate our metric on eight real-world text classification tasks on which S-WMD consistently outperforms almost all of our 26 competitive baselines.
TensorFlow -- Text Classification
On Nov 9, it's been an official 1 year since TensorFlow released. Looking back there has been a lot of progress done towards making TensorFlow the most used machine learning framework. And as this milestone passed, I realized that still haven't published long promised blog about text classification. Even though examples has been there in TensorFlow repository, they didn't have very good description. Text classification is one of the most important parts of machine learning, as most of people's communication is done via text.