Deep Learning Research Review: Natural Language Processing

@machinelearnbot 

If we had a million words (not really a lot in NLP standards), we'd have a million by million sized matrix which would be extremely sparse (lots of 0's). The basic idea behind word vector initialization techniques is that we want to store as much information as we can in this word vector while still keeping the dimensionality at a manageable scale (25 – 1000 dimensions is ideal). Formally, our function seeks to maximize the log probability of any context word given the current center word. One Sentence Summary: Word2Vec seeks to find vector representations of different words by maximizing the log probability of context words given a center word and modifying the vectors through SGD.