doc2vec
Word Embedding Techniques for Classification of Star Ratings
Abdelmotaleb, Hesham, McNeile, Craig, Wojtys, Malgorzata
Telecom services are at the core of today's societies' everyday needs. The availability of numerous online forums and discussion platforms enables telecom providers to improve their services by exploring the views of their customers to learn about common issues that the customers face. Natural Language Processing (NLP) tools can be used to process the free text collected. One way of working with such data is to represent text as numerical vectors using one of many word embedding models based on neural networks. This research uses a novel dataset of telecom customers' reviews to perform an extensive study showing how different word embedding algorithms can affect the text classification process. Several state-of-the-art word embedding techniques are considered, including BERT, Word2Vec and Doc2Vec, coupled with several classification algorithms. The important issue of feature engineering and dimensionality reduction is addressed and several PCA-based approaches are explored. Moreover, the energy consumption used by the different word embeddings is investigated. The findings show that some word embedding models can lead to consistently better text classifiers in terms of precision, recall and F1-Score. In particular, for the more challenging classification tasks, BERT combined with PCA stood out with the highest performance metrics. Moreover, our proposed PCA approach of combining word vectors using the first principal component shows clear advantages in performance over the traditional approach of taking the average.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Telecommunications (0.86)
- Energy (0.68)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Mapping Hymns and Organizing Concepts in the Rigveda: Quantitatively Connecting the Vedic Suktas
Bollineni, Venkatesh, Crk, Igor, Gultepe, Eren
Accessing and gaining insight into the Rigveda poses a non-trivial challenge due to its extremely ancient Sanskrit language, poetic structure, and large volume of text. By using NLP techniques, this study identified topics and semantic connections of hymns within the Rigveda that were corroborated by seven well-known groupings of hymns. The 1,028 suktas (hymns) from the modern English translation of the Rigveda by Jamison and Brereton were preprocessed and sukta-level embeddings were obtained using, i) a novel adaptation of LSA, presented herein, ii) SBERT, and iii) Doc2Vec embeddings. Following an UMAP dimension reduction of the vectors, the network of suktas was formed using k-nearest neighbours. Then, community detection of topics in the sukta networks was performed with the Louvain, Leiden, and label propagation methods, whose statistical significance of the formed topics were determined using an appropriate null distribution. Only the novel adaptation of LSA using the Leiden method, had detected sukta topic networks that were significant (z = 2.726, p < .01) with a modularity score of 0.944. Of the seven famous sukta groupings analyzed (e.g., creation, funeral, water, etc.) the LSA derived network was successful in all seven cases, while Doc2Vec was not significant and failed to detect the relevant suktas. SBERT detected four of the famous suktas as separate groups, but mistakenly combined three of them into a single mixed group. Also, the SBERT network was not statistically significant.
- Europe > Netherlands > South Holland > Leiden (0.45)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.05)
- (13 more...)
Implementing LLMs in industrial process modeling: Addressing Categorical Variables
Koronaki, Eleni D., Suntaxi, Geremy Loachamin, Papavasileiou, Paris, Giovanis, Dimitrios G., Kathrein, Martin, Boudouvis, Andreas G., Bordas, Stéphane P. A.
Important variables of processes are, in many occasions, categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Large Language Models (LLMs) to derive embeddings of such inputs that represent their actual meaning, or reflect the ``distances" between categories, i.e. how similar or dissimilar they are. This is a marked difference from the current standard practice of using binary, or one-hot encoding to replace categorical variables with sequences of ones and zeros. Combined with dimensionality reduction techniques, either linear such as Principal Components Analysis (PCA), or nonlinear such as Uniform Manifold Approximation and Projection (UMAP), the proposed approach leads to a \textit{meaningful}, low-dimensional feature space. The significance of obtaining meaningful embeddings is illustrated in the context of an industrial coating process for cutting tools that includes both numerical and categorical inputs. The proposed approach enables feature importance which is a marked improvement compared to the current state-of-the-art (SotA) in the encoding of categorical variables.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Greece (0.04)
- (4 more...)
Training Natural Language Processing Models on Encrypted Text for Enhanced Privacy
Tasar, Davut Emre, Tasar, Ceren Ocal
With the increasing use of cloud-based services for training and deploying machine learning models, data privacy has become a major concern. This is particularly important for natural language processing (NLP) models, which often process sensitive information such as personal communications and confidential documents. In this study, we propose a method for training NLP models on encrypted text data to mitigate data privacy concerns while maintaining similar performance to models trained on non-encrypted data. We demonstrate our method using two different architectures, namely Doc2Vec+XGBoost and Doc2Vec+LSTM, and evaluate the models on the 20 Newsgroups dataset. Our results indicate that both encrypted and non-encrypted models achieve comparable performance, suggesting that our encryption method is effective in preserving data privacy without sacrificing model accuracy. In order to replicate our experiments, we have provided a Colab notebook at the following address: https://t.ly/lR-TP
- Asia > Middle East > Republic of Türkiye > Karabuk Province > Karabuk (0.06)
- Asia > Middle East > Republic of Türkiye > İzmir Province > İzmir (0.05)
- Asia > India (0.05)
Discovery and Recognition of Formula Concepts using Machine Learning
Scharpf, Philipp, Schubotz, Moritz, Cohl, Howard S., Breitinger, Corinna, Gipp, Bela
Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula notation to refer to prior knowledge. Our long-term goal is to generalize citation-based IR methods and apply this generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulas could be cited and define a Formula Concept Retrieval task with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While FCD aims at the definition and exploration of a 'Formula Concept' that names bundled equivalent representations of a formula, FCR is designed to match a given formula to a prior assigned unique mathematical concept identifier. We present machine learning-based approaches to address the FCD and FCR tasks. We then evaluate these approaches on a standardized test collection (NTCIR arXiv dataset). Our FCD approach yields a precision of 68% for retrieving equivalent representations of frequent formulas and a recall of 72% for extracting the formula name from the surrounding text. FCD and FCR enable the citation of formulas within mathematical documents and facilitate semantic search and question answering as well as document similarity assessments for plagiarism detection or recommender systems.
- Europe > Germany > Lower Saxony > Gottingen (0.14)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- North America > United States > Texas > Tarrant County > Fort Worth (0.04)
- (8 more...)
Unsupervised extraction, labelling and clustering of segments from clinical notes
Zelina, Petr, Halámková, Jana, Nováček, Vít
This work is motivated by the scarcity of tools for accurate, unsupervised information extraction from unstructured clinical notes in computationally underrepresented languages, such as Czech. We introduce a stepping stone to a broad array of downstream tasks such as summarisation or integration of individual patient records, extraction of structured information for national cancer registry reporting or building of semi-structured semantic patient representations for computing patient embeddings. More specifically, we present a method for unsupervised extraction of semantically-labelled textual segments from clinical notes and test it out on a dataset of Czech breast cancer patients, provided by Masaryk Memorial Cancer Institute (the largest Czech hospital specialising in oncology). Our goal was to extract, classify (i.e. label) and cluster segments of the free-text notes that correspond to specific clinical features (e.g., family background, comorbidities or toxicities). The presented results demonstrate the practical relevance of the proposed approach for building more sophisticated extraction and analytical pipelines deployed on Czech clinical notes.
- Europe > Czechia > South Moravian Region > Brno (0.05)
- Europe > Ireland > Connaught > County Galway > Galway (0.04)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Health Care Technology > Medical Record (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Classification of Misinformation in New Articles using Natural Language Processing and a Recurrent Neural Network
Cunha, Brendan, Manikonda, Lydia
One of the first issues to address with these labels is the Misinformation in news articles has been one of the main inconsistency of scales used. For example, some labels are topics for discussion over the past few years. There have scaled from 0-3 in terms of level of misinformation, others been several organizations that developed methods for assessing are scaled in a binary manner with 0 and 1, and some have 4 reliability and personal bias of news coverage. In today's categorical values based on levels of media bias. So there is day in age, it is unnatural to arbitrarily trust the news quite a bit of processing that needed to be done to normalize outlets that claim to be truly objective and unbiased because everything and transform the qualitative variables into quantitative the term "bias" is relative. What one person perceives as variables.
- Asia > Russia (0.14)
- North America > United States > New York > Rensselaer County > Troy (0.04)
- Europe > Russia (0.04)
- Media > News (1.00)
- Government > Regional Government > North America Government > United States Government (0.48)
Two minutes NLP -- Doc2Vec in a nutshell
Doc2Vec is an unsupervised algorithm that learns embeddings from variable-length pieces of texts, such as sentences, paragraphs, and documents. It's originally presented in the paper Distributed Representations of Sentences and Documents. Let's review Word2Vec first, as it provides the inspiration for the Doc2Vec algorithm. Word2Vec learns word vectors by predicting a word in a sentence using the other words in the context. In this framework, every word is mapped to a unique vector, represented by a column in a matrix W. The concatenation or sum of the vectors is then used as features for the prediction of the next word in a sentence. The word vectors are trained using stochastic gradient descent.
Key Phrase Extraction & Applause Prediction
Yadav, Krishna, Choudhary, Lakshya
With the increase in content availability over the internet it is very difficult to get noticed. It has become an upmost the priority of the blog writers to get some feedback over their creations to be confident about the impact of their article. We are training a machine learning model to learn popular article styles, in the form of vector space representations using various word embeddings, and their popularity based on claps and tags.
Reading The Markets -- Machine Learning Versus The Financial News
Suffice it to say that they are a form of non-linear regression tool whose underlying design found inspiration in a simplification of the basic architecture of the human brain. Many of the great advances that we have experienced in Machine Learning over the last few years make use of neural networks. The basic algorithm has been around for decades -- but it has come into its own as processing power and data availability have steadily increased. For this project we implemented our neural network in Python using the popular TensorFlow library from Google. The characteristics of our neural network, and in particular its complexity, were chosen to balance precision and generalization.
- North America > United States > New York (0.04)
- Europe > Austria > Vienna (0.04)