fasttext
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Asia > Middle East > Jordan (0.04)
- (4 more...)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Asia > Middle East > Jordan (0.04)
- (4 more...)
A Comparative Analysis of Static Word Embeddings for Hungarian
This paper presents a comprehensive analysis of various static word embeddings for Hungarian, including traditional models such as Word2Vec, FastText, as well as static embeddings derived from BERT-based models using different extraction methods. We evaluate these embeddings on both intrinsic and extrinsic tasks to provide a holistic view of their performance. For intrinsic evaluation, we employ a word analogy task, which assesses the embeddings ability to capture semantic and syntactic relationships. Our results indicate that traditional static embeddings, particularly FastText, excel in this task, achieving high accuracy and mean reciprocal rank (MRR) scores. Among the BERT-based models, the X2Static method for extracting static embeddings demonstrates superior performance compared to decontextualized and aggregate methods, approaching the effectiveness of traditional static embeddings. For extrinsic evaluation, we utilize a bidirectional LSTM model to perform Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks. The results reveal that embeddings derived from dynamic models, especially those extracted using the X2Static method, outperform purely static embeddings. Notably, ELMo embeddings achieve the highest accuracy in both NER and POS tagging tasks, underscoring the benefits of contextualized representations even when used in a static form. Our findings highlight the continued relevance of static word embeddings in NLP applications and the potential of advanced extraction methods to enhance the utility of BERT-based models. This piece of research contributes to the understanding of embedding performance in the Hungarian language and provides valuable insights for future developments in the field. The training scripts, evaluation codes, restricted vocabulary, and extracted embeddings will be made publicly available to support further research and reproducibility.
- Europe > Hungary > Csongrád-Csanád County > Szeged (0.04)
- Europe > Hungary > Budapest > Budapest (0.04)
- Europe > Switzerland (0.04)
- (7 more...)
LLM Encoder vs. Decoder: Robust Detection of Chinese AI-Generated Text with LoRA
Jin, Houji, Ashrafi, Negin, Abdollahi, Armin, Liu, Wei, Wang, Jian, Gui, Ganyu, Pishgar, Maryam, Feng, Huanghao
The rapid growth of large language models (LLMs) has heightened the demand for accurate detection of AI-generated text, particularly in languages like Chinese, where subtle linguistic nuances pose significant challenges to current methods. In this study, we conduct a systematic comparison of encoder-based Transformers (Chinese BERT-large and RoBERTa-wwm-ext-large), a decoder-only LLM (Alibaba's Qwen2.5-7B/DeepSeek-R1-Distill-Qwen-7B fine-tuned via Low-Rank Adaptation, LoRA), and a FastText baseline using the publicly available dataset from the NLPCC 2025 Chinese AI-Generated Text Detection Task. Encoder models were fine-tuned using a novel prompt-based masked language modeling approach, while Qwen2.5-7B was adapted for classification with an instruction-format input and a lightweight classification head trained via LoRA. Experiments reveal that although encoder models nearly memorize training data, they suffer significant performance degradation under distribution shifts (RoBERTa: 76.3% test accuracy; BERT: 79.3%). FastText demonstrates surprising lexical robustness (83.5% accuracy) yet lacks deeper semantic understanding. In contrast, the LoRA-adapted Qwen2.5-7B achieves 95.94% test accuracy with balanced precision-recall metrics, indicating superior generalization and resilience to dataset-specific artifacts. These findings underscore the efficacy of decoder-based LLMs with parameter-efficient fine-tuning for robust Chinese AI-generated text detection. Future work will explore next-generation Qwen3 models, distilled variants, and ensemble strategies to enhance cross-domain robustness further.
- North America > United States > California (0.14)
- Asia > China (0.04)
ReProCon: Scalable and Resource-Efficient Few-Shot Biomedical Named Entity Recognition
Yoo, Jeongkyun, Riddle, Nela, Hoblitzell, Andrew
Named Entity Recognition (NER) in biomedical domains faces challenges due to data scarcity and imbalanced label distributions, especially with fine-grained entity types. We propose ReProCon, a novel few-shot NER framework that combines multi-prototype modeling, cosine-contrastive learning, and Reptile meta-learning to tackle these issues. By representing each category with multiple prototypes, ReProCon captures semantic variability, such as synonyms and contextual differences, while a cosine-contrastive objective ensures strong interclass separation. Reptile meta-updates enable quick adaptation with little data. Using a lightweight fastText + BiLSTM encoder with much lower memory usage, ReProCon achieves a macro-$F_1$ score close to BERT-based baselines (around 99 percent of BERT performance). The model remains stable with a label budget of 30 percent and only drops 7.8 percent in $F_1$ when expanding from 19 to 50 categories, outperforming baselines such as SpanProto and CONTaiNER, which see 10 to 32 percent degradation in Few-NERD. Ablation studies highlight the importance of multi-prototype modeling and contrastive learning in managing class imbalance. Despite difficulties with label ambiguity, ReProCon demonstrates state-of-the-art performance in resource-limited settings, making it suitable for biomedical applications.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > Indiana > Marion County > Indianapolis (0.04)
- (7 more...)
KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution
Thu, Ye Kyaw, Aung, Thura, Oo, Thazin Myint, Supnithi, Thepchai
This paper presents the first application of Kolmogorov-Arnold Convolution for Text (KAConvText) in sentence classification, addressing three tasks: imbalanced binary hate speech detection, balanced multiclass news classification, and imbalanced multiclass ethnic language identification. We investigate various embedding configurations, comparing random to fastText embeddings in both static and fine-tuned settings, with embedding dimensions of 100 and 300 using CBOW and Skip-gram models. Baselines include standard CNNs and CNNs augmented with a Kolmogorov-Arnold Network (CNN-KAN). In addition, we investigated KAConvText with different classification heads - MLP and KAN, where using KAN head supports enhanced interpretability. Results show that KAConvText-MLP with fine-tuned fastText embeddings achieves the best performance of 91.23% accuracy (F1-score = 0.9109) for hate speech detection, 92.66% accuracy (F1-score = 0.9267) for news classification, and 99.82% accuracy (F1-score = 0.9982) for language identification.
- Asia > Myanmar > Tanintharyi Region > Dawei (0.05)
- Asia > Thailand (0.04)
- Europe > Russia (0.04)
- (3 more...)
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.40)
- Overview > Innovation (0.40)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Memory-Efficient FastText: A Comprehensive Approach Using Double-Array Trie Structures and Mark-Compact Memory Management
FastText has established itself as a fundamental algorithm for learning word representations, demonstrating exceptional capability in handling out-of-vocabulary words through character-level n-gram embeddings. However, its hash-based bucketing mechanism introduces critical limitations for large-scale industrial deployment: hash collisions cause semantic drift, and memory requirements become prohibitively expensive when dealing with real-world vocabularies containing millions of terms. This paper presents a comprehensive memory optimization framework that fundamentally reimagines FastText's memory management through the integration of double-array trie (DA-trie) structures and mark-compact garbage collection principles. Our approach leverages the linguistic insight that n-grams sharing common prefixes or suffixes exhibit highly correlated embeddings due to co-occurrence patterns in natural language. By systematically identifying and merging semantically similar embeddings based on structural relationships, we achieve compression ratios of 4:1 to 10:1 while maintaining near-perfect embedding quality. The algorithm consists of four sophisticated phases: prefix trie construction with embedding mapping, prefix-based similarity compression, suffix-based similarity compression, and mark-compact memory reorganization. Comprehensive experiments on a 30-million Chinese vocabulary dataset demonstrate memory reduction from over 100GB to approximately 30GB with negligible performance degradation. Our industrial deployment results show significant cost reduction, faster loading times, and improved model reliability through the elimination of hash collision artifacts. Code and experimental implementations are available at: https://github.com/initial-d/me_fasttext
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Hardware > Memory (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Omni TM-AE: A Scalable and Interpretable Embedding Model Using the Full Tsetlin Machine State Space
Kadhim, Ahmed K., Jiao, Lei, Shafik, Rishad, Granmo, Ole-Christoffer
The increasing complexity of large-scale language models has amplified concerns regarding their interpretability and reusability. While traditional embedding models like Word2Vec and GloVe offer scalability, they lack transparency and often behave as black boxes. Conversely, interpretable models such as the Tsetlin Machine (TM) have shown promise in constructing explainable learning systems, though they previously faced limitations in scalability and reusability. In this paper, we introduce Omni Tsetlin Machine AutoEncoder (Omni TM-AE), a novel embedding model that fully exploits the information contained in the TM's state matrix, including literals previously excluded from clause formation. This method enables the construction of reusable, interpretable embeddings through a single training phase. Extensive experiments across semantic similarity, sentiment classification, and document clustering tasks show that Omni TM-AE performs competitively with and often surpasses mainstream embedding models. These results demonstrate that it is possible to balance performance, scalability, and interpretability in modern Natural Language Processing (NLP) systems without resorting to opaque architectures.
- Europe > Norway (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Europe > United Kingdom > England > Tyne and Wear > Newcastle (0.04)
Poisoned Source Code Detection in Code Models
Ghannoum, Ehab, Ghafari, Mohammad
Deep learning models have gained popularity for conducting various tasks involving source code. However, their black-box nature raises concerns about potential risks. One such risk is a poisoning attack, where an attacker intentionally contaminates the training set with malicious samples to mislead the model's predictions in specific scenarios. To protect source code models from poisoning attacks, we introduce CodeGarrison (CG), a hybrid deep-learning model that relies on code embeddings to identify poisoned code samples. We evaluated CG against the state-of-the-art technique ONION for detecting poisoned samples generated by DAMP, MHM, ALERT, as well as a novel poisoning technique named CodeFooler. Results showed that CG significantly outperformed ONION with an accuracy of 93.5%. We also tested CG's robustness against unknown attacks, achieving an average accuracy of 85.6% in identifying poisoned samples across the four attacks mentioned above.
- North America > Dominican Republic (0.04)
- Europe > Germany (0.04)
- Europe > France (0.04)
Text classification using machine learning methods
In this paper we present the results of an experiment aimed to use machine learning methods to obtain models that can be used for the automatic classification of products. In order to apply automatic classification methods, we transformed the product names from a text representation to numeric vectors, a process called word embedding. We used several embedding methods: Count Vectorization, TF-IDF, Word2Vec, FASTTEXT, and GloVe. Having the product names in a form of numeric vectors, we proceeded with a set of machine learning methods for automatic classification: Logistic Regression, Multinomial Naive Bayes, kNN, Artificial Neural Networks, Support Vector Machines, and Decision trees with several variants. The results show an impressive accuracy of the classification process for Support Vector Machines, Logistic Regression, and Random Forests. Regarding the word embedding methods, the best results were obtained with the FASTTEXT technique.
- North America > United States (0.15)
- Europe > Romania (0.15)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.50)