Goto

Collaborating Authors

 Text Classification


Walmart Kaggle: Trip Type Classification

@machinelearnbot

They took the NYC Data Science Academy 12-week full-time data science bootcamp program from Sep. 23 to Dec. 18, 2015. The post was based on their fourth in-class project (due after the 8th week of the program). Walmart uses trip type classification to segment its shoppers and their store visits to better improve the shopping experience. Walmart's trip types are created from a combination of existing customer insights and purchase history data. The purpose of the Kaggle competition is to use only the purchase data provided to derive Walmart's classification labels.



Document Type Classification in Online Digital Libraries

AAAI Conferences

Online digital libraries make it easier for researchers to search for scientific information. They have been proven as powerful resources in many data mining, machine learning and information retrieval applications that require high-quality data. The quality of the data highly depends on the accuracy of classifiers that identify the types of documents that are crawled from the Web, e.g., as research papers, slides, books, etc., for appropriate indexing. These classifiers in turn depend on the choice of the feature representation. We propose novel features that result in high-accuracy classifiers for document type classification. Experimental results on several datasets show that our classifiers outperform models that are employed in current systems.


Distributional Correspondence Indexing for Cross-Lingual and Cross-Domain Sentiment Classification.

Journal of Artificial Intelligence Research

Domain Adaptation (DA) techniques aim at enabling machine learning methods learn effective classifiers for a "target'' domain when the only available training data belongs to a different "source'' domain. In this paper we present the Distributional Correspondence Indexing (DCI) method for domain adaptation in sentiment classification. DCI derives term representations in a vector space common to both domains where each dimension reflects its distributional correspondence to a pivot, i.e., to a highly predictive term that behaves similarly across domains. Term correspondence is quantified by means of a distributional correspondence function (DCF). We propose a number of efficient DCFs that are motivated by the distributional hypothesis, i.e., the hypothesis according to which terms with similar meaning tend to have similar distributions in text. Experiments show that DCI obtains better performance than current state-of-the-art techniques for cross-lingual and cross-domain sentiment classification. DCI also brings about a significantly reduced computational cost, and requires a smaller amount of human intervention. As a final contribution, we discuss a more challenging formulation of the domain adaptation problem, in which both the cross-domain and cross-lingual dimensions are tackled simultaneously.


Character-level Convolutional Networks for Text Classification

Neural Information Processing Systems

This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.


A Subspace Learning Framework for Cross-Lingual Sentiment Classification with Partial Parallel Data

AAAI Conferences

Cross-lingual sentiment classification aims to automatically predict sentiment polarity (e.g., positive or negative) of data in a label-scarce target language by exploiting labeled data from a label-rich language. The fundamental challenge of cross-lingual learning stems from a lack of overlap between the feature spaces of the source language data and that of the target language data. To address this challenge, previous work in the literature mainly relies on the large amount of bilingual parallel corpora to bridge the language gap. In many real applications, however, it is often the case that we have some partial parallel data but it is an expensive and time-consuming job to acquire large amount of parallel data on different languages. In this paper, we propose a novel subspace learning framework by leveraging the partial parallel data for cross-lingual sentiment classification. The proposed approach is achieved by jointly learning the document-aligned review data and un-aligned data from the source language and the target language via a non-negative matrix factorization framework. We conduct a set of experiments with cross-lingual sentiment classification tasks on multilingual Amazon product reviews. Our experimental results demonstrate the efficacy of the proposed cross-lingual approach.


Feature Ensemble Plus Sample Selection: Domain Adaptation for Sentiment Classification (Extended Abstract)

AAAI Conferences

The domain adaptation problem arises often in the field of sentiment classification. There are two distinct needs in domain adaptation, namely labeling adaptation and instance adaptation. Most of current research focuses on the former one, while neglects the latter one. In this work, we propose a joint approach, named feature ensemble plus sample selection (SS-FE), which takes both types of adaptation into account. A feature ensemble (FE) model is first proposed to learn a new labeling function in a feature re-weighting manner. Furthermore, a PCA-based sample selection (PCA-SS) method is proposed as an aid to FE for instance adaptation. Experimental results show that the proposed SS-FE approach could gain significant improvements, compared to individual FE and PCA-SS, due to its comprehensive consideration of both labeling adaptation and instance adaptation.


Recurrent Convolutional Neural Networks for Text Classification

AAAI Conferences

Text classification is a foundational task in many NLP applications. Traditional text classifiers often rely on many human-designed features, such as dictionaries, knowledge bases and special tree kernels. In contrast to traditional methods, we introduce a recurrent convolutional neural network for text classification without human-designed features. In our model, we apply a recurrent structure to capture contextual information as far as possible when learning word representations, which may introduce considerably less noise compared to traditional window-based neural networks. We also employ a max-pooling layer that automatically judges which words play key roles in text classification to capture the key components in texts. We conduct experiments on four commonly used datasets. The experimental results show that the proposed method outperforms the state-of-the-art methods on several datasets, particularly on document-level datasets.


Dataless Text Classification with Descriptive LDA

AAAI Conferences

Manually labeling documents for training a text classifier is expensive and time-consuming. Moreover, a classifier trained on labeled documents may suffer from overfitting and adaptability problems. Dataless text classification (DLTC) has been proposed as a solution to these problems, since it does not require labeled documents. Previous research in DLTC has used explicit semantic analysis of Wikipedia content to measure semantic distance between documents, which is in turn used to classify test documents based on nearest neighbours. The semantic-based DLTC method has a major drawback in that it relies on a large-scale, finely-compiled semantic knowledge base, which is difficult to obtain in many scenarios. In this paper we propose a novel kind of model, descriptive LDA (DescLDA), which performs DLTC with only category description words and unlabeled documents. In DescLDA, the LDA model is assembled with a describing device to infer Dirichlet priors from prior descriptive documents created with category description words. The Dirichlet priors are then used by LDA to induce category-aware latent topics from unlabeled documents. Experimental results with the 20Newsgroups and RCV1 datasets show that: (1) our DLTC method is more effective than the semantic-based DLTC baseline method; and (2) the accuracy of our DLTC method is very close to state-of-the-art supervised text classification methods. As neither external knowledge resources nor labeled documents are required, our DLTC method is applicable to a wider range of scenarios.


Exploring Social Context for Topic Identification in Short and Noisy Texts

AAAI Conferences

With the pervasion of social media, topic identification in short texts attracts increasing attention in  recent years. However, in nature the texts of social media are short and noisy, and the structures are sparse and dynamic, resulting in difficulty to identify topic categories exactly from online social media. Inspired by social science findings that preference consistency and social contagion are observed in social media, we investigate topic identification in short and noisy texts by exploring social context from the perspective of social sciences. In particular, we present a mathematical optimization formulation that incorporates the preference consistency and social contagion theories into a supervised learning method, and conduct feature selection to tackle short and noisy texts in social media, which result in a Sociological framework for Topic Identification (STI). Experimental results on real-world datasets from Twitter and Citation Network demonstrate the effectiveness of the proposed framework. Further experiments are conducted to understand the importance of social context in topic identification.