AITopics | word2vec

Collaborating Authors

word2vec

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass

Neural Information Processing SystemsFeb-12-2026, 09:31:31 GMT

Recently, there is an increasing interest in learning the semantics of a language directly, and only from rawspeech [24,27,28].

machine learning, natural language, translation, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models

Ahmad, Fiaz, Hussain, Nisar, Qasim, Amna, Hafeez, Momina, Sidorov, Muhammad Usman Grigori, Gelbukh, Alexander

arXiv.org Artificial IntelligenceOct-28-2025

Ironic identification is a challenging task in Natural Language Processing, particularly when dealing with languages that differ in syntax and cultural context. In this work, we aim to detect irony in Urdu by translating an English Ironic Corpus into the Urdu language. We evaluate ten state-of-the-art machine learning algorithms using GloVe and Word2Vec embeddings, and compare their performance with classical methods. Additionally, we fine-tune advanced transformer-based models, including BERT, RoBERTa, LLaMA 2 (7B), LLaMA 3 (8B), and Mistral, to assess the effectiveness of large-scale models in irony detection. Among machine learning models, Gradient Boosting achieved the best performance with an F1-score of 89.18%. Among transformer-based models, LLaMA 3 (8B) achieved the highest performance with an F1-score of 94.61%. These results demonstrate that combining transliteration techniques with modern NLP models enables robust irony detection in Urdu, a historically low-resource language.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.22356

Country:

Europe (0.28)
North America > Mexico (0.15)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish

Hu, Jinfan Frank

arXiv.org Artificial IntelligenceSep-30-2025

Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic and semantic information. This study evaluates the impact of various tokenization strategies - word-level, character-level, n-gram, and Byte Pair Encoding (BPE) - on the quality of static word embeddings generated by Word2Vec for Turkish and Finnish. Using a 10,000-article Wikipedia corpus, we trained models under low-resource conditions and evaluated them on a Named Entity Recognition (NER) task. Despite the theoretical appeal of subword segmentation, word-level tokenization consistently outperformed all alternatives across all tokenization strategies tested. These findings suggest that in agglutinative, low-resource contexts, preserving boundaries via word-level tokenization may yield better embedding performance than complex statistical methods. This has practical implications for developing NLP pipelines for under-resourced languages where annotated data and computing power are limited.

artificial intelligence, natural language, tokenization strategy, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ACDSA65407.2025.11165911

2509.14238

Country:

Europe (1.00)
North America > United States (0.94)
Asia > Middle East > Republic of Türkiye (0.68)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis

Awlla, Kozhin muhealddin, Veisi, Hadi, Abdullah, Abdulhady Abas

arXiv.org Artificial IntelligenceSep-23-2025

This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Encoder Representations from Transformers (BERT) into Natural Language Processing techniques. Kurdish is a low - resourced language, having a high level of linguistic diversity with minimal computational resources, making sentiment analysis somewhat challenging. Earlier, this was done using a traditional w ord embedding model, such as Word2Vec, but with the emergence of new language models, specifically BERT, there is hope for improvements. The better word embedding capabilities of BERT lend to this study, aiding in the capturing of the nuanced semantic pool and the contextual intricacies of the language under study, the Kurdish language, thus setting a new benchmark for sentiment analysis in low - resource languages. The steps include collecting and normalizing a large corpus of Kurdish texts, pretraining BERT with a special tokenizer for Kurdish, and developing different models for sentiment analysis including Bidirectional Long Short - Term Memory ( BiLSTM), Multi - L ayer Perceptron ( MLP), and finetuning the BERT classifier . The proposed approach consists of 3 cla sses: positive, negative, and neutral sentiment analysis using a sentiment embedding of BERT in four different configurations. The accuracy of the best - performing classifier, BiLSTM, is 74.09%. For the BERT with an MLP classifier model, the maximum accuracy achieved is 73.96%, while the fine - tuned BERT model tops the others with 75.37% accuracy. Additionally, the fine - tuned BERT model demonstrates a vast improvement when focused on t wo 2 - class sentiment analyses positive and negative with an accuracy of 86.

machine learning, natural language, sentiment analysis, (18 more...)

arXiv.org Artificial Intelligence

2509.16804

Country: Asia > Middle East > Iraq (0.68)

Genre: Research Report > New Finding (0.94)

Industry:

Media (0.68)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

f08223bc8d177df6807811c32f5acfed-Paper-Conference.pdf

Neural Information Processing SystemsAug-19-2025, 18:28:08 GMT

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(7 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.70)

Add feedback

Large Language Models for Detection of Life-Threatening Texts

Nguyen, Thanh Thi, Wilson, Campbell, Dalins, Janis

arXiv.org Artificial IntelligenceJun-13-2025

Detecting life-threatening language is essential for safeguarding individuals in distress, promoting mental health and well-being, and preventing potential harm and loss of life. This paper presents an effective approach to identifying life-threatening texts using large language models (LLMs) and compares them with traditional methods such as bag of words, word embedding, topic modeling, and Bidirectional Encoder Representations from Transformers. We fine-tune three open-source LLMs including Gemma, Mistral, and Llama-2 using their 7B parameter variants on different datasets, which are constructed with class balance, imbalance, and extreme imbalance scenarios. Experimental results demonstrate a strong performance of LLMs against traditional methods. More specifically, Mistral and Llama-2 models are top performers in both balanced and imbalanced data scenarios while Gemma is slightly behind. We employ the upsampling technique to deal with the imbalanced data scenarios and demonstrate that while this method benefits traditional approaches, it does not have as much impact on LLMs. This study demonstrates a great potential of LLMs for real-world life-threatening language detection problems.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.10687

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine > Therapeutic Area (0.34)
Health & Medicine > Consumer Health (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Text classification using machine learning methods

Oancea, Bogdan

arXiv.org Artificial IntelligenceFeb-27-2025

In this paper we present the results of an experiment aimed to use machine learning methods to obtain models that can be used for the automatic classification of products. In order to apply automatic classification methods, we transformed the product names from a text representation to numeric vectors, a process called word embedding. We used several embedding methods: Count Vectorization, TF-IDF, Word2Vec, FASTTEXT, and GloVe. Having the product names in a form of numeric vectors, we proceeded with a set of machine learning methods for automatic classification: Logistic Regression, Multinomial Naive Bayes, kNN, Artificial Neural Networks, Support Vector Machines, and Decision trees with several variants. The results show an impressive accuracy of the classification process for Support Vector Machines, Logistic Regression, and Random Forests. Regarding the word embedding methods, the best results were obtained with the FASTTEXT technique.

classification, product name, representation, (15 more...)

arXiv.org Artificial Intelligence

2502.19801

Country:

North America > United States > New York > New York County > New York City (0.05)
Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.05)
Asia > India (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.50)

Add feedback

Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings

Kadhim, Ahmed K., Jiao, Lei, Shafik, Rishad, Granmo, Ole-Christoffer

arXiv.org Artificial IntelligenceJan-31-2025

In recent years, text generation tools utilizing Artificial Intelligence (AI) have occasionally been misused across various domains, such as generating student reports or creative writings. This issue prompts plagiarism detection services to enhance their capabilities in identifying AI-generated content. Adversarial attacks are often used to test the robustness of AI-text generated detectors. This work proposes a novel textual adversarial attack on the detection models such as Fast-DetectGPT. The method employs embedding models for data perturbation, aiming at reconstructing the AI generated texts to reduce the likelihood of detection of the true origin of the texts. Specifically, we employ different embedding techniques, including the Tsetlin Machine (TM), an interpretable approach in machine learning for this purpose. By combining synonyms and embedding similarity vectors, we demonstrates the state-of-the-art reduction in detection scores against Fast-DetectGPT. Particularly, in the XSum dataset, the detection score decreased from 0.4431 to 0.2744 AUROC, and in the SQuAD dataset, it dropped from 0.5068 to 0.3532 AUROC.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.18998

Country: Europe > United Kingdom > England > Tyne and Wear > Newcastle (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.91)
Education > Educational Technology > Educational Software (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Research on Violent Text Detection System Based on BERT-fasttext Model

Yang, Yongsheng, Wang, Xiaoying

arXiv.org Artificial IntelligenceDec-20-2024

In the digital age of today, the internet has become an indispensable platform for people's lives, work, and information exchange. However, the problem of violent text proliferation in the network environment has arisen, which has brought about many negative effects. In view of this situation, it is particularly important to build an effective system for cutting off violent text. The study of violent text cutting off based on the BERT-fasttext model has significant meaning. BERT is a pre-trained language model with strong natural language understanding ability, which can deeply mine and analyze text semantic information; Fasttext itself is an efficient text classification tool with low complexity and good effect, which can quickly provide basic judgments for text processing. By combining the two and applying them to the system for cutting off violent text, on the one hand, it can accurately identify violent text, and on the other hand, it can efficiently and reasonably cut off the content, preventing harmful information from spreading freely on the network. Compared with the single BERT model and fasttext, the accuracy was improved by 0.7% and 0.8%, respectively. Through this model, it is helpful to purify the network environment, maintain the health of network information, and create a positive, civilized, and harmonious online communication space for netizens, driving the development of social networking, information dissemination, and other aspects in a more benign direction.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.16455

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

Unveiling Topological Structures in Text: A Comprehensive Survey of Topological Data Analysis Applications in NLP

Uchendu, Adaku, Le, Thai

arXiv.org Artificial IntelligenceDec-14-2024

The surge of data available on the internet has led to the adoption of various computational methods to analyze and extract valuable insights from this wealth of information. Among these, the field of Machine Learning (ML) has thrived by leveraging data to extract meaningful insights. However, ML techniques face notable challenges when dealing with real-world data, often due to issues of imbalance, noise, insufficient labeling, and high dimensionality. To address these limitations, some researchers advocate for the adoption of Topological Data Analysis (TDA), a statistical approach that discerningly captures the intrinsic shape of data despite noise. Despite its potential, TDA has not gained as much traction within the Natural Language Processing (NLP) domain compared to structurally distinct areas like computer vision. Nevertheless, a dedicated community of researchers has been exploring the application of TDA in NLP, yielding 87 papers we comprehensively survey in this paper. Our findings categorize these efforts into theoretical and non-theoretical approaches. Theoretical approaches aim to explain linguistic phenomena from a topological viewpoint, while non-theoretical approaches merge TDA with ML features, utilizing diverse numerical representation techniques. We conclude by exploring the challenges and unresolved questions that persist in this niche field. Resources and a list of papers on this topic can be found at: https://github.com/AdaUchendu/AwesomeTDA4NLP.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.10298

Country:

Oceania > Australia (0.04)
North America > United States > Texas (0.04)
North America > United States > Missouri > Greene County > Springfield (0.04)
(13 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.34)

Industry:

Government (0.93)
Information Technology > Security & Privacy (0.69)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
(2 more...)

Add feedback