hyper-parametrisation
k-NN Embedding Stability for word2vec Hyper-Parametrisation in Scientific Text
Word embeddings are increasingly attracting the attention of researchers dealing with semantic similarity and analogy tasks. However, finding the optimal hyper-parameters remains an important challenge due to the resulting impact on the revealed analogies mainly for domain-specific corpora. While analogies are highly used for hypotheses synthesis, it is crucial to optimise word embedding hyper-parameters for precise hypothesis synthesis. Therefore, we propose, in this paper, a methodological approach for tuning word embedding hyper-parameters by using the stability of k-nearest neighbors of word vectors within scientific corpora and more specifically Computer Science corpora with Machine learning adopted as a case study. This approach is tested on a dataset created from NIPS (Conference on Neural Information Processing Systems) publications, and evaluated with a curated ACM hierarchy and Wikipedia Machine Learning outline as the gold standard. Our quantitative and qualitative analysis indicate that our approach not only reliably captures interesting patterns like "unsupervised_learning is to kmeans as supervised_learning is to knn", but also captures the analogical hierarchy structure of Machine Learning and consistently outperforms the \(61\%\) sate-of-the-art embeddings on syntactic accuracy with \(68\%\).