Collaborating Authors

Word2vec vs BERT


Both word2vec and BERT are recent popular methods in NLP which are used for generating vector representation of words. Essentially replacing the use of word index dictionaries and one hot encoded vectors to represent text. Both word-index and one hot encoding methods do not capture the semantic sense of language. Also, one hot encoding becomes computationally infeasible if the size of vocabulary is LARGE. Word2vec [1] is a neural network approach to learn distributed word vectors in a way that words used in similar syntactic or semantic context, lie closer to each other in the distributed vector space.

DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings Artificial Intelligence

Word embeddings are a core component of modern natural language processing systems, making the ability to thoroughly evaluate them a vital task. We describe DiaLex, a benchmark for intrinsic evaluation of dialectal Arabic word embedding. DiaLex covers five important Arabic dialects: Algerian, Egyptian, Lebanese, Syrian, and Tunisian. Across these dialects, DiaLex provides a testbank for six syntactic and semantic relations, namely male to female, singular to dual, singular to plural, antonym, comparative, and genitive to past tense. DiaLex thus consists of a collection of word pairs representing each of the six relations in each of the five dialects. To demonstrate the utility of DiaLex, we use it to evaluate a set of existing and new Arabic word embeddings that we developed. Our benchmark, evaluation code, and new word embedding models will be publicly available.

A Review of Different Word Embeddings for Sentiment Classification using Deep Learning Machine Learning

The web is loaded with textual content, and Natural Language Processing is a standout amongst the most vital fields in Machine Learning. But when data is huge simple Machine Learning algorithms are not able to handle it and that is when Deep Learning comes into play which based on Neural Networks. However since neural networks cannot process raw text, we have to change over them through some diverse strategies of word embedding. This paper demonstrates those distinctive word embedding strategies implemented on an Amazon Review Dataset, which has two sentiments to be classified: Happy and Unhappy based on numerous customer reviews. Moreover we demonstrate the distinction in accuracy with a discourse about which word embedding to apply when.

Twitter Sentiment on Affordable Care Act using Score Embedding Machine Learning

Mohsen Farhadloo, PhD John Molson Scool of Business, Concordia University August 21, 2019 Abstract In this paper we introduce score embedding, a neural network based model to learn interpretable vector representations for words. Score embedding is a supervised method that takes advantage of the labeled training data and the neural network architecture to learn interpretable representations for words. Health care has been a controversial issue between political parties in the United States. In this paper we use the discussions on Twitter regarding different issues of affordable care act to identify the public opinion about the existing health care plans using the proposed score embedding. Our results indicate our approach effectively incorporates the sentiment information and outperforms or is at least comparable to the state-of-the-art methods and the negative sentiment towards "TrumpCare" was consistently greater than neutral and positive sentiment over time. 1 Introduction Sentiment analysis as a type of text categorization is the task of identifying the sentiment orientation of documents written in natural language which assigns one of the predefined sentiment categories into a whole document or pieces of the document such as phrases or sentences [23, 8]. Many studies used binary classification and reported high performance [18, 29, 24] and some studies have observed that the performance of the categorization reduces as the number of sentiment categories increases [2, 16, 3, 11]. Bag-Of-Words (BOW), a standard approach for text categorization, represents a document by a vector that indicates the words that appear in the document.

Introduction to Word Embedding


Humans have always excelled at understanding languages. It is easy for humans to understand the relationship between words but for computers, this task may not be simple. For example, we humans understand the words like king and queen, man and woman, tiger and tigress have a certain type of relation between them but how can a computer figure this out? Word embeddings are basically a form of word representation that bridges the human understanding of language to that of a machine. They have learned representations of text in an n-dimensional space where words that have the same meaning have a similar representation.