Extrofitting: Enriching Word Representation and its Vector Space with Semantic Lexicons

arXiv.org Artificial Intelligence

We propose post-processing method for enriching not only word representation but also its vector space using semantic lexicons, which we call extrofitting. The method consists of 3 steps as follows: (i) Expanding 1 or more dimension(s) on all the word vectors, filling with their representative value. (ii) Transferring semantic knowledge by averaging each representative values of synonyms and filling them in the expanded dimension(s). These two steps make representations of the synonyms close together. (iii) Projecting the vector space using Linear Discriminant Analysis, which eliminates the expanded dimension(s) with semantic knowledge. When experimenting with GloVe, we find that our method outperforms Faruqui's retrofitting on some of word similarity task. We also report further analysis on our method in respect to word vector dimensions, vocabulary size as well as other well-known pretrained word vectors (e.g., Word2Vec, Fasttext).

Information retrieval document search using vector space model in R


Note, there are many variations in the way we calculate the term-frequency(tf) and inverse document frequency (idf), in this post we have seen one variation. Below images show as the other recommended variations of tf and idf, taken from wiki.

Sentence Similarity in Python using Doc2Vec – Kanoki


Numeric representation of Text documents is challenging task in machine learning and there are different ways there to create the numerical features for texts such as vector representation using Bag of Words, Tf-IDF etc.I am not going in detail what are the advantages of one over the other or which is the best one to use in which case. There are lot of good reads available to explain this. It's a Model to create the word embeddings, where it takes input as a large corpus of text and produces a vector space typically of several hundred dimesions. The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: "Bank", "money" and "accounts" are often used in similar situations, with similar surrounding words like "dollar", "loan" or "credit", and according to Word2Vec they will therefore share a similar vector representation.

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

AAAI Conferences

This paper describes a vector space equalization scheme for a concept-based collaborative information retrieval system; evaluation results are given. The authors previously proposed a peer-to-peer information exchange system that aims at smooth knowledge and information management to activate organizations and communities. One problem with the system arises when information is retrieved from another's personal repository since the framework's retrieval criteria are strongly personalized. The system is assumed to employ a vector space model and a concept-base as its information retrieval mechanism. The vector space of one system is very different from that of another system, so retrieval results would not reflect the requester's intention. This paper presents a vector space equalization scheme, the automated relevance feedback scheme, that compensates the differences in the vector spaces of the personal repositories. A system that implements the scheme is realized and evaluated using documents on the Internet. This paper presents implementation details, the evaluation procedure, and evaluation results.

A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Neural Information Processing Systems

We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to nontrivial bound values and - for maximum margins - to a vanishing complexity term.Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1 Introduction Linear classifiers are exceedingly popular in the machine learning community due to their straightforward applicability and high flexibility which has recently been boosted by the so-called kernel methods [13]. A natural and popular framework for the theoretical analysis of classifiers is the PAC (probably approximately correct) framework[11] which is closely related to Vapnik's work on the generalisation error [12]. For binary classifiers it turned out that the growth function is an appropriate measureof "complexity" and can tightly be upper bounded by the VC (Vapnik-Chervonenkis) dimension [14].