cbow
Detecting Turkish Synonyms Used in Different Time Periods
Yazar, Umur Togay, Kutlu, Mucahid
Dynamic structure of languages poses significant challenges in applying natural language processing models on historical texts, causing decreased performance in various downstream tasks. Turkish is a prominent example of rapid linguistic transformation due to the language reform in the 20th century. In this paper, we propose two methods for detecting synonyms used in different time periods, focusing on Turkish. In our first method, we use Orthogonal Procrustes method to align the embedding spaces created using documents written in the corresponding time periods. In our second method, we extend the first one by incorporating Spearman's correlation between frequencies of words throughout the years. In our experiments, we show that our proposed methods outperform the baseline method. Furthermore, we observe that the efficacy of our methods remains consistent when the target time period shifts from the 1960s to the 1980s. However, their performance slightly decreases for subsequent time periods.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
Incremental and Data-Efficient Concept Formation to Support Masked Word Prediction
Lian, Xin, Baglodi, Nishant, MacLellan, Christopher J.
This paper introduces Cobweb4L, a novel approach for efficient language model learning that supports masked word prediction. The approach builds on Cobweb, an incremental system that learns a hierarchy of probabilistic concepts. Each concept stores the frequencies of words that appear in instances tagged with that concept label. The system utilizes an attribute value representation to encode words and their surrounding context into instances. Cobweb4L uses the information theoretic variant of category utility and a new performance mechanism that leverages multiple concepts to generate predictions. We demonstrate that with these extensions it significantly outperforms prior Cobweb performance mechanisms that use only a single node to generate predictions. Further, we demonstrate that Cobweb4L learns rapidly and achieves performance comparable to and even superior to Word2Vec. Next, we show that Cobweb4L and Word2Vec outperform BERT in the same task with less training data. Finally, we discuss future work to make our conclusions more robust and inclusive.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks
Ali, Wazir, Tumrani, Saifullah, Kumar, Jay, Soomro, Tariq Rahim
In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches.
- Asia > Afghanistan > Kabul Province > Kabul (0.04)
- Asia > Pakistan > Sindh > Karachi Division > Karachi (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (12 more...)
Bidirectional Attention as a Mixture of Continuous Word Experts
Wibisono, Kevin Christian, Wang, Yixin
Bidirectional attention $\unicode{x2013}$ composed of self-attention with positional encodings and the masked language model (MLM) objective $\unicode{x2013}$ has emerged as a key component of modern large language models (LLMs). Despite its empirical success, few studies have examined its statistical underpinnings: What statistical model is bidirectional attention implicitly fitting? What sets it apart from its non-attention predecessors? We explore these questions in this paper. The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights. Further, bidirectional attention with multiple heads and multiple layers is equivalent to stacked MoEs and a mixture of MoEs, respectively. This statistical viewpoint reveals the distinct use of MoE in bidirectional attention, which aligns with its practical effectiveness in handling heterogeneous data. It also suggests an immediate extension to categorical tabular data, if we view each word location in a sentence as a tabular feature. Across empirical studies, we find that this extension outperforms existing tabular extensions of transformers in out-of-distribution (OOD) generalization. Finally, this statistical perspective of bidirectional attention enables us to theoretically characterize when linear word analogies are present in its word embeddings. These analyses show that bidirectional attention can require much stronger assumptions to exhibit linear word analogies than its non-attention predecessors.
- North America > United States > Michigan (0.04)
- Europe > Greece (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > Iraq > Baghdad Governorate > Baghdad (0.04)
Realised Volatility Forecasting: Machine Learning via Financial Word Embedding
Rahimikia, Eghbal, Zohren, Stefan, Poon, Ser-Huang
This study develops FinText, a financial word embedding compiled from 15 years of business news archives. The results show that FinText produces substantially more accurate results than general word embeddings based on the gold-standard financial benchmark we introduced. In contrast to well-known econometric models, and over the sample period from 27 July 2007 to 27 January 2022 for 23 NASDAQ stocks, using stock-related news, our simple natural language processing model supported by different word embeddings improves realised volatility forecasts on high volatility days. This improvement in realised volatility forecasting performance switches to normal volatility days when general hot news is used. By utilising SHAP, an Explainable AI method, we also identify and classify key phrases in stock-related and general hot news that moved volatility.
- Asia > North Korea (0.28)
- Asia > China (0.04)
- Asia > Japan (0.04)
- (9 more...)
- Media > News (1.00)
- Information Technology (1.00)
- Health & Medicine (1.00)
- (3 more...)
Word2Vec
Word2Vec is a Two Layer Neural Network based Continuous bag of word (CBOW) and Skip-gram architecture that captures the semantic information. It generates the word embedding (mapping of words in a vector space) for a given text corpus. It converts the words into vectors and vectors performs an operation like add, subtract, calculating distance, etc. which preserves the relationship among the words. How are the relationships among words are formed? Word2Vec assigns similar vector representation to the similar words.
Theory Behind the Basics of NLP - Analytics Vidhya
This article was published as a part of the Data Science Blogathon. Natural Language Processing (NLP) can help you to understand any text's sentiments. This is helpful for people to understand the emotions and the type of text they are looking over. Negative and Positive comments can be easily differentiated. NLP wanted to make machines understand the text or comment the same way humans can.
How to Use Arabic Word2Vec Word Embedding with LSTM
Word embedding is the approach of learning word and their relative meanings from a corpus of text and representing the word as a dense vector. The word vector is the projection of the word into a continuous feature vector space, see Figure 1 (A) for clarity. Words that have similar meaning should be close together in the vector space as illustrated in see Figure 1 (B). Word2vec is one of the most popular words embedding in NLP. Word2vec has two types, Continuous Bag-of-Words Model (CBOW) and Continuous Skip-gram Model [3], the model architectures are shown in Figure 2. CBOW predicts the word according to the given context, where Skip-gram predicts the context according to the given word, which increases the computational complexity [3].
Word Embeddings
A word embedding is a representation of a word as a vector, or sequence of numbers. Often times these vectors encode how the word is used in conjunction with other words in a dataset. Both the technique for encoding and the dataset used can vary greatly and ultimately depends on the appropriate use case. Word embeddings have ubiquitous use cases in NLP/ML, and allow computers or mathematical equations to reason about words. Computers only see words as a sequence of individual characters, which is not often useful when reasoning about the semantic or syntactic usage of a word in a language.
k\=oan: A Corrected CBOW Implementation
İrsoy, Ozan, Benton, Adrian, Stratos, Karl
It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software libraries such as the official implementation word2vec.c and Gensim. We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train. We release our implementation, k\=oan, at https://github.com/bloomberg/koan.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)