High-Dimensional Vector Semantics

Andrecut, M.

arXiv.org Artificial Intelligence 

In many natural language processing tasks the words and the documents are represented using the "bag of words" model. In such a model, a document is represented by a high-dimensional vector, with the components corresponding to the frequency of a particular word in the document (for a detailed discussion see [1-3] and the references within). For example, assuming an English vocabulary of 25, 000 words, each document will be represented by a 25, 000 dimensional vector, where the component i is the frequency of the ith word in the document. The vector representation is particularly useful in text classification tasks, where the similarity of two documents can be simply estimated using the dot product between the vectors. If the vectors are normalized, then their dot product is equal to the cosine of the angle between the vectors, and therefore the more parallel the vectors are, the more similar the documents are.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found