Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
Vinokourov, Alexei, Cristianini, Nello, Shawe-Taylor, John
–Neural Information Processing Systems
The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English documentand its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard andin a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated.Since we assume the two representations are completely independentapart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patternsof French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation basedon such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.
Neural Information Processing Systems
Dec-31-2003
- Country:
- Europe > United Kingdom
- England (0.14)
- North America > Canada (0.69)
- Europe > United Kingdom
- Genre:
- Research Report (0.69)
- Industry:
- Food & Agriculture (0.47)
- Technology: