Word embedding -- the mapping of words into numerical vector spaces -- has proved to be an incredibly important method for natural language processing (NLP) tasks in recent years, enabling various machine learning models that rely on vector representation as input to enjoy richer representations of text input. These representations preserve more semantic and syntactic information on words, leading to improved performance in almost every imaginable NLP task. Both the novel idea itself and its tremendous impact have led researchers to consider the problem of how to provide this boon of richer vector representations to larger units of texts -- from sentences to books. This effort has resulted in a slew of new methods to produce these mappings, with various innovative solutions to the problem and some notable breakthroughs. This post is meant to present the different ways practitioners have come up with to produce document embeddings. Note: I use the word document here to refer to any sequence of words, ranging from sentences and paragraphs through social media posts all way up to articles, books and more complexly structured text documents (e.g. In this post, I will touch upon not only approaches that are direct extensions of word embedding techniques (e.g., in the way doc2vec extends word2vec), but also other notable techniques that produce -- sometimes among other outputs -- a mapping of documents to vectors in ℝⁿ. I will also try to provide links and references to both the original papers and code implementations of the reviewed methods whenever possible. Note: This topic is somewhat related, but not equivalent, to the problem of learning structured text representations (e.g., Liu & Lapata, 2018). The ability to map documents to informative vector representations has a wide range of applications.
A few days ago I found out that there had appeared lda2vec (by Chris Moody) – a hybrid algorithm combining best ideas from well-known LDA (Latent Dirichlet Allocation) topic modeling algorithm and from a bit less well-known tool for language modeling named word2vec. And now I'm going to tell you a tale about lda2vec and my attempts to try it and compare with simple LDA implementation (I used gensim package for this). It means that LDA is able to create document (and topic) representations that are not so flexible but mostly interpretable to humans. Also, LDA treats a set of documents as a set of documents, whereas word2vec works with a set of documents as with a very long text string. So, lda2vec took the idea of "locality" from word2vec, because it is local in the way that it is able to create vector representations of words (aka word embeddings) on small text intervals (aka windows).
Artificial Intelligence is a topic that has been getting a lot of attention, mostly because of the rapid improvement that this field has undertaken. Amazing innovations today, are setting foundations for amazing achievements such as medical Research and even Flying Cars. Back in the 1950s, the fathers of the field Minsky and McCarthy described artificial intelligence to be any task performed by a program or a machine that, if a human carried out the same activity, we would say the human had to apply intelligence to accomplish the task. As you can see, this is a fairly broad description so, nowadays, everything associated with human intelligence: planning, learning, reasoning, problem-solving, knowledge representation, perception, motion, and manipulation and, to a lesser extent, social intelligence and creativity is described as AI. Now that we realized what AI actually means, let's find out what is it used for today!
The design and implementation of Fuzzy Intelligent systems are studied. These systems operate in conditions of uncertainty, imprecision and ambiguity. Their behavior is described by Fuzzy Algorithms (FA) in the form of Fuzzy production rules. A new approach to design of such Fuzzy systems is suggested. The approach is based on a following: (a) Fuzzy Petri Nets (FPN) as new model and tools for a formal representation and modeling of given Fuzzy systems; (b) isomorphism between two representations of Fuzzy algorithms (in terms of its Fuzzy PN and its Fuzzy Finite Automata) as a base for their Hardware realization.
Topic modelling, in the context of Natural Language Processing, is described as a method of uncovering hidden structure in a collection of texts. Although that is indeed true it is also a pretty useless definition. Let's define topic modeling in more practical terms. There are several scenarios when topic modeling can prove useful. There are several algorithms for doing topic modeling.