Text Classification
Supervised Word Mover's Distance
Accurately measuring the similarity between text documents lies at the core of many real world applications of machine learning. These include web-search ranking, document recommendation, multi-lingual document matching, and article categorization. Recently, a new document metric, the word mover's distance (WMD), has been proposed with unprecedented results on kNN-based document classification. The WMD elevates high quality word embeddings to document metrics by formulating the distance between two documents as an optimal transport problem between the embedded words. However, the document distances are entirely unsupervised and lack a mechanism to incorporate supervision when available. In this paper we propose an efficient technique to learn a supervised metric, which we call the Supervised WMD (S-WMD) metric.
Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification
Bikash Joshi, Massih R. Amini, Ioannis Partalas, Franck Iutzeler, Yury Maximov
We address the problem of multi-class classification in the case where the number of classes is very large. We propose a double sampling strategy on top of a multi-class to binary reduction strategy, which transforms the original multi-class problem into a binary classification problem over pairs of examples. The aim of the sampling strategy is to overcome the curse of long-tailed class distributions exhibited in majority of large-scale multi-class classification problems and to reduce the number of pairs of examples in the expanded data. We show that this strategy does not alter the consistency of the empirical risk minimization principle defined over the double sample reduction. Experiments are carried out on DMOZ and Wikipedia collections with 10,000 to 100,000 classes where we show the efficiency of the proposed approach in terms of training and prediction time, memory consumption, and predictive performance with respect to state-of-the-art approaches.
Reviews: Diffusion Maps for Textual Network Embedding
The main idea of this paper is to use the diffusion convolutional operator to learn text embedding that takes into account the global influence of the whole graph. It then incorporates the diffusion process in the loss function to capture high-order proximity. In contrast, previous works either neglect the semantic distance indicated from the graph, or fails to take into account the similarities of context influenced by global structural information. The author then conducts experiments on the task of multi-label classification of text and link prediction and shows that the proposed model outperforms the baselines. Strength: The high level idea of of this paper is good, and the method is novel.
Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification
Bikash Joshi, Massih R. Amini, Ioannis Partalas, Franck Iutzeler, Yury Maximov
We address the problem of multi-class classification in the case where the number of classes is very large. We propose a double sampling strategy on top of a multi-class to binary reduction strategy, which transforms the original multi-class problem into a binary classification problem over pairs of examples. The aim of the sampling strategy is to overcome the curse of long-tailed class distributions exhibited in majority of large-scale multi-class classification problems and to reduce the number of pairs of examples in the expanded data. We show that this strategy does not alter the consistency of the empirical risk minimization principle defined over the double sample reduction. Experiments are carried out on DMOZ and Wikipedia collections with 10,000 to 100,000 classes where we show the efficiency of the proposed approach in terms of training and prediction time, memory consumption, and predictive performance with respect to state-of-the-art approaches.
Table 5: Per-class accuracy scores on RVL-CDIP-N for each document classification model
There are several patterns: all models perform well on scientific_publication but typically poorly on handwritten and specification. We find that there is a higher degree of similarity in model predictions on RVL-CDIP-N than on RVL-CDIP-O. We find that the Augraphy augmentations typically have little impact. The main finding in Tables 14 - 17 is that out-of-domain detection suffers on the more realistic RVL-CDIP-N versus RVL-CDIP-O setting, where both sets of test documents are out-of-distribution. This is in contrast with the T-O setting where we use the in-distribution RVL-CDIP test set as the in-domain data.