Text Classification using String Kernels
Lodhi, Huma, Shawe-Taylor, John, Cristianini, Nello, Watkins, Christopher J. C. H.
–Neural Information Processing Systems
HUlna Lodhi John Shawe-Taylor N ello Cristianini Chris Watkins Department of Computer Science Royal Holloway, University of London Egham, Surrey TW20 OEX, UK {huma, john, nello, chrisw}Cdcs.rhbnc.ac.uk Abstract We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence ofk characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation ofthis feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique.
Neural Information Processing Systems
Dec-31-2001