AITopics | doc2vec

Collaborating Authors

doc2vec

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Word Embedding Techniques for Classification of Star Ratings

Abdelmotaleb, Hesham, McNeile, Craig, Wojtys, Malgorzata

arXiv.org Artificial IntelligenceApr-21-2025

Telecom services are at the core of today's societies' everyday needs. The availability of numerous online forums and discussion platforms enables telecom providers to improve their services by exploring the views of their customers to learn about common issues that the customers face. Natural Language Processing (NLP) tools can be used to process the free text collected. One way of working with such data is to represent text as numerical vectors using one of many word embedding models based on neural networks. This research uses a novel dataset of telecom customers' reviews to perform an extensive study showing how different word embedding algorithms can affect the text classification process. Several state-of-the-art word embedding techniques are considered, including BERT, Word2Vec and Doc2Vec, coupled with several classification algorithms. The important issue of feature engineering and dimensionality reduction is addressed and several PCA-based approaches are explored. Moreover, the energy consumption used by the different word embeddings is investigated. The findings show that some word embedding models can lead to consistently better text classifiers in terms of precision, recall and F1-Score. In particular, for the more challenging classification tasks, BERT combined with PCA stood out with the highest performance metrics. Moreover, our proposed PCA approach of combining word vectors using the first principal component shows clear advantages in performance over the traditional approach of taking the average.

classifier, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.13653

Country:

North America > United States (0.46)
Europe > Sweden (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Telecommunications (0.86)
Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Mapping Hymns and Organizing Concepts in the Rigveda: Quantitatively Connecting the Vedic Suktas

Bollineni, Venkatesh, Crk, Igor, Gultepe, Eren

arXiv.org Artificial IntelligenceMar-23-2025

Accessing and gaining insight into the Rigveda poses a non-trivial challenge due to its extremely ancient Sanskrit language, poetic structure, and large volume of text. By using NLP techniques, this study identified topics and semantic connections of hymns within the Rigveda that were corroborated by seven well-known groupings of hymns. The 1,028 suktas (hymns) from the modern English translation of the Rigveda by Jamison and Brereton were preprocessed and sukta-level embeddings were obtained using, i) a novel adaptation of LSA, presented herein, ii) SBERT, and iii) Doc2Vec embeddings. Following an UMAP dimension reduction of the vectors, the network of suktas was formed using k-nearest neighbours. Then, community detection of topics in the sukta networks was performed with the Louvain, Leiden, and label propagation methods, whose statistical significance of the formed topics were determined using an appropriate null distribution. Only the novel adaptation of LSA using the Leiden method, had detected sukta topic networks that were significant (z = 2.726, p < .01) with a modularity score of 0.944. Of the seven famous sukta groupings analyzed (e.g., creation, funeral, water, etc.) the LSA derived network was successful in all seven cases, while Doc2Vec was not significant and failed to detect the relevant suktas. SBERT detected four of the famous suktas as separate groups, but mistakenly combined three of them into a single mixed group. Also, the SBERT network was not statistically significant.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2503.18226

Country:

Europe > Netherlands > South Holland > Leiden (0.45)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Europe > Belgium > Brussels-Capital Region > Brussels (0.05)
(13 more...)

Genre: Research Report > Experimental Study (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Implementing LLMs in industrial process modeling: Addressing Categorical Variables

Koronaki, Eleni D., Suntaxi, Geremy Loachamin, Papavasileiou, Paris, Giovanis, Dimitrios G., Kathrein, Martin, Boudouvis, Andreas G., Bordas, Stéphane P. A.

arXiv.org Machine LearningSep-27-2024

Important variables of processes are, in many occasions, categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Large Language Models (LLMs) to derive embeddings of such inputs that represent their actual meaning, or reflect the ``distances" between categories, i.e. how similar or dissimilar they are. This is a marked difference from the current standard practice of using binary, or one-hot encoding to replace categorical variables with sequences of ones and zeros. Combined with dimensionality reduction techniques, either linear such as Principal Components Analysis (PCA), or nonlinear such as Uniform Manifold Approximation and Projection (UMAP), the proposed approach leads to a \textit{meaningful}, low-dimensional feature space. The significance of obtaining meaningful embeddings is illustrated in the context of an industrial coating process for cutting tools that includes both numerical and categorical inputs. The proposed approach enables feature importance which is a marked improvement compared to the current state-of-the-art (SotA) in the encoding of categorical variables.

implementing llm, industrial process modeling, insert shape, (14 more...)

arXiv.org Machine Learning

2409.19097

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Greece (0.04)
(4 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Materials (0.47)
Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Training Natural Language Processing Models on Encrypted Text for Enhanced Privacy

Tasar, Davut Emre, Tasar, Ceren Ocal

arXiv.org Artificial IntelligenceMay-2-2023

With the increasing use of cloud-based services for training and deploying machine learning models, data privacy has become a major concern. This is particularly important for natural language processing (NLP) models, which often process sensitive information such as personal communications and confidential documents. In this study, we propose a method for training NLP models on encrypted text data to mitigate data privacy concerns while maintaining similar performance to models trained on non-encrypted data. We demonstrate our method using two different architectures, namely Doc2Vec+XGBoost and Doc2Vec+LSTM, and evaluate the models on the 20 Newsgroups dataset. Our results indicate that both encrypted and non-encrypted models achieve comparable performance, suggesting that our encryption method is effective in preserving data privacy without sacrificing model accuracy. In order to replicate our experiments, we have provided a Colab notebook at the following address: https://t.ly/lR-TP

machine learning, natural language, text data, (17 more...)

arXiv.org Artificial Intelligence

2305.03497

Country:

Asia > Middle East > Republic of Türkiye > Karabuk Province > Karabuk (0.06)
Asia > Middle East > Republic of Türkiye > İzmir Province > İzmir (0.05)
Asia > India (0.05)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Discovery and Recognition of Formula Concepts using Machine Learning

Scharpf, Philipp, Schubotz, Moritz, Cohl, Howard S., Breitinger, Corinna, Gipp, Bela

arXiv.org Artificial IntelligenceMar-19-2023

Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula notation to refer to prior knowledge. Our long-term goal is to generalize citation-based IR methods and apply this generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulas could be cited and define a Formula Concept Retrieval task with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While FCD aims at the definition and exploration of a 'Formula Concept' that names bundled equivalent representations of a formula, FCR is designed to match a given formula to a prior assigned unique mathematical concept identifier. We present machine learning-based approaches to address the FCD and FCR tasks. We then evaluate these approaches on a standardized test collection (NTCIR arXiv dataset). Our FCD approach yields a precision of 68% for retrieving equivalent representations of frequent formulas and a recall of 72% for extracting the formula name from the surrounding text. FCD and FCR enable the citation of formulas within mathematical documents and facilitate semantic search and question answering as well as document similarity assessments for plagiarism detection or recommender systems.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2303.01994

Country:

Europe > Germany > Lower Saxony > Gottingen (0.14)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
North America > United States > Texas > Tarrant County > Fort Worth (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.88)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(2 more...)

Add feedback

Unsupervised extraction, labelling and clustering of segments from clinical notes

Zelina, Petr, Halámková, Jana, Nováček, Vít

arXiv.org Artificial IntelligenceNov-21-2022

This work is motivated by the scarcity of tools for accurate, unsupervised information extraction from unstructured clinical notes in computationally underrepresented languages, such as Czech. We introduce a stepping stone to a broad array of downstream tasks such as summarisation or integration of individual patient records, extraction of structured information for national cancer registry reporting or building of semi-structured semantic patient representations for computing patient embeddings. More specifically, we present a method for unsupervised extraction of semantically-labelled textual segments from clinical notes and test it out on a dataset of Czech breast cancer patients, provided by Masaryk Memorial Cancer Institute (the largest Czech hospital specialising in oncology). Our goal was to extract, classify (i.e. label) and cluster segments of the free-text notes that correspond to specific clinical features (e.g., family background, comorbidities or toxicities). The presented results demonstrate the practical relevance of the proposed approach for building more sophisticated extraction and analytical pipelines deployed on Czech clinical notes.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/BIBM55620.2022.9995229

2211.11799

Country:

Europe > Czechia > South Moravian Region > Brno (0.05)
Europe > Ireland > Connaught > County Galway > Galway (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Classification of Misinformation in New Articles using Natural Language Processing and a Recurrent Neural Network

Cunha, Brendan, Manikonda, Lydia

arXiv.org Artificial IntelligenceOct-24-2022

One of the first issues to address with these labels is the Misinformation in news articles has been one of the main inconsistency of scales used. For example, some labels are topics for discussion over the past few years. There have scaled from 0-3 in terms of level of misinformation, others been several organizations that developed methods for assessing are scaled in a binary manner with 0 and 1, and some have 4 reliability and personal bias of news coverage. In today's categorical values based on levels of media bias. So there is day in age, it is unnatural to arbitrarily trust the news quite a bit of processing that needed to be done to normalize outlets that claim to be truly objective and unbiased because everything and transform the qualitative variables into quantitative the term "bias" is relative. What one person perceives as variables.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2210.13534

Country:

Asia > Russia (0.14)
North America > United States > New York > Rensselaer County > Troy (0.04)
Europe > Russia (0.04)

Genre: Research Report (0.40)

Industry:

Media > News (1.00)
Government > Regional Government > North America Government > United States Government (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Two minutes NLP -- Doc2Vec in a nutshell

#artificialintelligenceDec-13-2021, 08:52:06 GMT

Doc2Vec is an unsupervised algorithm that learns embeddings from variable-length pieces of texts, such as sentences, paragraphs, and documents. It's originally presented in the paper Distributed Representations of Sentences and Documents. Let's review Word2Vec first, as it provides the inspiration for the Doc2Vec algorithm. Word2Vec learns word vectors by predicting a word in a sentence using the other words in the context. In this framework, every word is mapped to a unique vector, represented by a column in a matrix W. The concatenation or sum of the vectors is then used as features for the prediction of the next word in a sentence. The word vectors are trained using stochastic gradient descent.

doc2vec, vector, word vector, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

Key Phrase Extraction & Applause Prediction

Yadav, Krishna, Choudhary, Lakshya

arXiv.org Artificial IntelligenceJan-1-2021

With the increase in content availability over the internet it is very difficult to get noticed. It has become an upmost the priority of the blog writers to get some feedback over their creations to be confident about the impact of their article. We are training a machine learning model to learn popular article styles, in the form of vector space representations using various word embeddings, and their popularity based on claps and tags.

key phrase extraction, phrase extraction, phrase extraction & applause prediction, (12 more...)

arXiv.org Artificial Intelligence

2101.03235

Country:

Asia > India > NCT > Delhi (0.05)
Asia > Myanmar (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Reading The Markets -- Machine Learning Versus The Financial News

#artificialintelligenceNov-26-2019, 23:15:00 GMT

Suffice it to say that they are a form of non-linear regression tool whose underlying design found inspiration in a simplification of the basic architecture of the human brain. Many of the great advances that we have experienced in Machine Learning over the last few years make use of neural networks. The basic algorithm has been around for decades -- but it has come into its own as processing power and data availability have steadily increased. For this project we implemented our neural network in Python using the popular TensorFlow library from Google. The characteristics of our neural network, and in particular its complexity, were chosen to balance precision and generalization.

algorithm, news article, sentiment, (15 more...)

#artificialintelligence

Country:

North America > United States > New York (0.04)
Europe > Austria > Vienna (0.04)

Genre: Research Report (0.47)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Add feedback