High-Dimensional Vector Semantics

Feb-23-2018–arXiv.org Artificial Intelligence

In many natural language processing tasks the words and the documents are represented using the "bag of words" model. In such a model, a document is represented by a high-dimensional vector, with the components corresponding to the frequency of a particular word in the document (for a detailed discussion see [1-3] and the references within). For example, assuming an English vocabulary of 25, 000 words, each document will be represented by a 25, 000 dimensional vector, where the component i is the frequency of the ith word in the document. The vector representation is particularly useful in text classification tasks, where the similarity of two documents can be simply estimated using the dot product between the vectors. If the vectors are normalized, then their dot product is equal to the cosine of the angle between the vectors, and therefore the more parallel the vectors are, the more similar the documents are.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Feb-23-2018

arXiv.org PDF

Add feedback

Country:
- Europe (0.68)
- North America
  - Canada (0.46)
  - United States (0.28)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Performance Analysis (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found