High-Dimensional Vector Semantics
–arXiv.org Artificial Intelligence
In many natural language processing tasks the words and the documents are represented using the "bag of words" model. In such a model, a document is represented by a high-dimensional vector, with the components corresponding to the frequency of a particular word in the document (for a detailed discussion see [1-3] and the references within). For example, assuming an English vocabulary of 25, 000 words, each document will be represented by a 25, 000 dimensional vector, where the component i is the frequency of the ith word in the document. The vector representation is particularly useful in text classification tasks, where the similarity of two documents can be simply estimated using the dot product between the vectors. If the vectors are normalized, then their dot product is equal to the cosine of the angle between the vectors, and therefore the more parallel the vectors are, the more similar the documents are.
arXiv.org Artificial Intelligence
Feb-23-2018
- Country:
- Europe (0.68)
- North America
- Canada (0.46)
- United States (0.28)
- Genre:
- Research Report (0.50)
- Technology: