skip-gram
Learning Word Embedding with Better Distance Weighting and Window Size Scheduling
As a highly successful word embedding model, Word2Vec offers an efficient method for learning distributed word representations on large datasets. However, Word2Vec lacks consideration for distances between center and context words. We propose two novel methods, Learnable Formulated Weights (LFW) and Epoch-based Dynamic Window Size (EDWS), to incorporate distance information into two variants of Word2Vec, the Continuous Bag-of-Words (CBOW) model and the Continuous Skip-gram (Skip-gram) model. For CBOW, LFW uses a formula with learnable parameters that best reflects the relationship of influence and distance between words to calculate distance-related weights for average pooling, providing insights for future NLP text modeling research. For Skip-gram, we improve its dynamic window size strategy to introduce distance information in a more balanced way. Experiments prove the effectiveness of LFW and EDWS in enhancing Word2Vec's performance, surpassing previous state-of-the-art methods.
Contrastive Loss is All You Need to Recover Analogies as Parallel Lines
Ri, Narutatsu, Lee, Fei-Tzin, Verma, Nakul
While static word embedding models are known to represent linguistic analogies as parallel lines in high-dimensional space, the underlying mechanism as to why they result in such geometric structures remains obscure. We find that an elementary contrastive-style method employed over distributional information performs competitively with popular word embedding models on analogy recovery tasks, while achieving dramatic speedups in training time. Further, we demonstrate that a contrastive loss is sufficient to create these parallel structures in word embeddings, and establish a precise relationship between the co-occurrence statistics and the geometric structure of the resulting word embeddings.
Robust Dynamic Network Embedding via Ensembles
Hou, Chengbin, Fu, Guoji, Yang, Peng, He, Shan, Tang, Ke
Dynamic Network Embedding (DNE) has recently attracted considerable attention due to the advantage of network embedding in various applications and the dynamic nature of many real-world networks. For dynamic networks, the degree of changes, i.e., defined as the averaged number of changed edges between consecutive snapshots spanning a dynamic network, could be very different in real-world scenarios. Although quite a few DNE methods have been proposed, it still remains unclear that whether and to what extent the existing DNE methods are robust to the degree of changes, which is however an important factor in both academic research and industrial applications. In this work, we investigate the robustness issue of DNE methods w.r.t. the degree of changes for the first time and accordingly, propose a robust DNE method. Specifically, the proposed method follows the notion of ensembles where the base learner adopts an incremental Skip-Gram neural embedding approach. To further boost the performance, a novel strategy is proposed to enhance the diversity among base learners at each timestep by capturing different levels of local-global topology. Extensive experiments demonstrate the benefits of special designs in the proposed method, and the superior performance of the proposed method compared to state-of-the-art methods. The comparative study also reveals the robustness issue of some DNE methods. The source code is available at https://github.com/houchengbin/SG-EDNE
Word Embeddings in High-Level
The most common representation of words in NLP tasks is the One Hot Encoding. Below we can see an example of One Hot Encoding for the words "Cat" and "Dog". As we can see, these two vectors are independent since their inner product is 0, and their Euclidean distance is the square root of 2. Notice that this applies to every pair in the vocabulary, meaning that every pair of words are independent, and their distance is the square root of 2. Notice that this applies to every pair in the vocabulary, meaning that every pair of words are independent, and their distance is \(\sqrt(2)\). For example, the words below are considered independent, and the distance -- similarity between any pair of words is the same. This is an issue for NLP tasks since we want to be able to capture the relation between words.
Using Word2Vec for Better Embeddings of Categorical Features
Back in 2012, when neural networks regained popularity, people were excited about the possibility of training models without having to worry about feature engineering. Indeed, most of the earliest breakthroughs were in computer vision, in which raw pixels were used as input for networks. Soon enough it turned out that if you wanted to use textual data, clickstream data, or pretty much any data with categorical features, at some point you'd have to ask yourself -- how do I represent my categorical features as vectors that my network can work with? The most popular approach is embedding layers -- you add an extra layer to your network, which assigns a vector to each value of the categorical feature. During training the network learns the weights for the different layers, including those embeddings.
Improving Skip-Gram based Graph Embeddings via Centrality-Weighted Sampling
Almagro-Blanco, Pedro, Sancho-Caparrini, Fernando
Network embedding techniques inspired by word2vec represent an effective unsupervised relational learning model. Commonly, by means of a Skip-Gram procedure, these techniques learn low dimensional vector representations of the nodes in a graph by sampling node-context examples. Although many ways of sampling the context of a node have been proposed, the effects of the way a node is chosen have not been analyzed in depth. To fill this gap, we have re-implemented the main four word2vec inspired graph embedding techniques under the same framework and analyzed how different sampling distributions affects embeddings performance when tested in node classification problems. We present a set of experiments on different well known real data sets that show how the use of popular centrality distributions in sampling leads to improvements, obtaining speeds of up to 2 times in learning times and increasing accuracy in all cases.
Word2vec Made Easy
This post is a simplified yet in-depth guide to word2vec. In this article, we will implement word2vec model from scratch and see how embedding help to find similar/dissimilar words. Word2Vec is the foundation of NLP( Natural Language Processing). Tomas Mikolov and the team of researchers developed the technique in 2013 at Google. Their approach first published in the paper'Efficient Estimation of Word Representations in Vector Space'.
Concept2vec: Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Alshargi, Faisal, Shekarpour, Saeedeh, Soru, Tommaso, Sheth, Amit
Although there is an emerging trend towards generating embeddings for primarily unstructured data, and recently for structured data, there is not yet any systematic suite for measuring the quality of embeddings. This deficiency is further sensed with respect to embeddings generated for structured data because there are no concrete evaluation metrics measuring the quality of encoded structure as well as semantic patterns in the embedding space. In this paper, we introduce a framework containing three distinct tasks concerned with the individual aspects of ontological concepts: (i) the categorization aspect, (ii) the hierarchical aspect, and (iii) the relational aspect. Then, in the scope of each task, a number of intrinsic metrics are proposed for evaluating the quality of the embeddings. Furthermore, w.r.t. this framework multiple experimental studies were run to compare the quality of the available embedding models. Employing this framework in future research can reduce misjudgment and provide greater insight about quality comparisons of embeddings for ontological concepts.
A non-NLP application of Word2Vec – Towards Data Science – Medium
The above is exactly what Word2Vec seeks to do: it tries to determine the meaning of a word by analyzing its neighboring words (also called context). The algorithm exists in two flavors CBOW and Skip-Gram. Given a set of sentences (also called corpus) the model loops on the words of each sentence and either tries to use the current word of to predict its neighbors (its context), in which case the method is called "Skip-Gram", or it uses each of these contexts to predict the current word, in which case the method is called "Continuous Bag Of Words" (CBOW). The limit on the number of words in each context is determined by a parameter called "window size". So if we choose for example the Skip-Gram method, Word2Vec then consists of using a shallow neural network, i.e. a neural network of only one hidden layer, to learn the word embedding. The network first initializes randomly its weights then iteratively adapt these during training to minimize the error it makes when using words to predict their contexts.
word2vec Skip-Gram with Negative Sampling is a Weighted Logistic PCA
Landgraf, Andrew J., Bellay, Jeremy
Mikolov et al. (2013) introduced the skip-gram formulation for neural word embeddings, wherein one tries to predict the context of a given word. Their negative-sampling algorithm improved the computational feasibility of training the embeddings. Due to their state-of-the-art performance on a number of tasks, there has been much research aimed at better understanding it. Goldberg and Levy (2014) showed that skip-gram with negative-sampling algorithm (SGNS) maximizes a different likelihood than the skip-gram formulation poses and further showed how it is implicitly related to pointwise mutual information (Levy and Goldberg, 2014). We show that SGNS is a weighted logistic PCA, which is a special case of exponential family PCA for the binomial likelihood. Cotterell et al. (2017) showed that the skip-gram formulation can be viewed as exponential family PCA with a multinomial likelihood, but they did not make the connection between the negative-sampling algorithm and the binomial likelihood. Li et al. (2015) showed that SGNS is an explicit matrix factorization related to representation learning, but the matrix factorization objective they found was complicated and they did not find the connection to the binomial distribution or exponential family PCA.