Although there has been significant previous work on semi-supervised learning for classification, there has been relatively little in sequence modeling. This paper presents an approach that leverages recent work in manifold-learning on sequences to discover word clusters from language data, including both syntactic classes and semantic topics. From unlabeled data we form a smooth, low-dimensional feature space, where each word token is projected based on its underlying role as a function or content word. We then use this projection as additional input features to a linear-chain conditional random field trained on limited labeled training data. On standard part-of-speech tagging and Chinese word segmentation data sets we show as much as 14% error reduction due to the unlabeled data, and also statistically-significant improvements over a related semi-supervised sequence tagging method due to Miller et al.
Jan-11-2006, 06:08:48 GMT