AITopics | tf-idf

Abusive speech on social media poses a persistent and evolving challenge, driven by the continuous emergence of novel slang and obfuscated terms designed to circumvent detection systems. In this work, we present a data efficient strategy for fine tuning BERT on hate speech classification by significantly reducing training set size without compromising performance. Our approach employs a TF IDF-based sample selection mechanism to retain only the most informative 75 percent of examples, thereby minimizing training overhead. To address the limitations of BERT's native vocabulary in capturing evolving hate speech terminology, we augment the tokenizer with domain-specific slang and lexical variants commonly found in abusive contexts. Experimental results on a widely used hate speech dataset demonstrate that our method achieves competitive performance while improving computational efficiency, highlighting its potential for scalable and adaptive abusive content moderation.

machine learning, natural language, training data, (17 more...)

arXiv.org Artificial Intelligence

2512.02141

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Identification of Malicious Posts on the Dark Web Using Supervised Machine Learning

Filho, Sebastião Alves de Jesus, Bernardo, Gustavo Di Giovanni, Gabriel, Paulo Henrique Ribeiro, Zarpelão, Bruno Bogaz, Miani, Rodrigo Sanches

arXiv.org Artificial IntelligenceDec-1-2025

Given the constant growth and increasing sophistication of cyberattacks, cybersecurity can no longer rely solely on traditional defense techniques and tools. Proactive detection of cyber threats has become essential to help security teams identify potential risks and implement effective mitigation measures. Cyber Threat Intelligence (CTI) plays a key role by providing security analysts with evidence-based knowledge about cyber threats. CTI information can be extracted using various techniques and data sources; however, machine learning has proven promising. As for data sources, social networks and online discussion forums are commonly explored. In this study, we apply text mining techniques and machine learning to data collected from Dark Web forums in Brazilian Portuguese to identify malicious posts. Our contributions include the creation of three original datasets, a novel multi-stage labeling process combining indicators of compromise (IoCs), contextual keywords, and manual analysis, and a comprehensive evaluation of text representations and classifiers. To our knowledge, this is the first study to focus specifically on Brazilian Portuguese content in this domain. The best-performing model, using LightGBM and TF-IDF, was able to detect relevant posts with high accuracy. We also applied topic modeling to validate the model's outputs on unlabeled data, confirming its robustness in real-world scenarios.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.23183

Country: South America > Brazil (0.68)

Genre: Research Report > New Finding (0.66)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.88)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.93)
(2 more...)

Add feedback

Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Aung, Thura, Kyaw, Eaint Kay Khaing, Thu, Ye Kyaw, Oo, Thazin Myint, Supnithi, Thepchai

arXiv.org Artificial IntelligenceNov-27-2025

In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.21081

Country:

Asia > Myanmar (0.16)
Asia > Thailand (0.14)

Genre: Research Report > New Finding (0.49)

Industry: Government (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.69)

Add feedback

ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features

de Vink, A. J. W., Amat-Lefort, Natalia, Han, Lifeng

arXiv.org Artificial IntelligenceNov-18-2025

In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble). While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen's Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page https://github.com/aaronlifenghan/ReviewGraph

large language model, machine learning, oversampling 0, (21 more...)

arXiv.org Artificial Intelligence

2508.13953

Country: Europe (0.68)

Genre: Research Report > New Finding (0.71)

Industry: Consumer Products & Services > Hotels (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

An Efficient Classification Model for Cyber Text

Hossen, Md Sakhawat, Borshon, Md. Zashid Iqbal, Badrudduza, A. S. M.

arXiv.org Artificial IntelligenceNov-6-2025

The uprising of deep learning methodology and practice in recent years has brought about a severe consequence of increasing carbon footprint due to the insatiable demand for computational resources and power. The field of text analytics also experienced a massive transformation in this trend of monopolizing methodology. In this paper, the original TF-IDF algorithm has been modified, and Clement Term Frequency-Inverse Document Frequency (CTF-IDF) has been proposed for data preprocessing. This paper primarily discusses the effectiveness of classical machine learning techniques in text analytics with CTF-IDF and a faster IRLBA algorithm for dimensionality reduction. The introduction of both of these techniques in the conventional text analytics pipeline ensures a more efficient, faster, and less computationally intensive application when compared with deep learning methodology regarding carbon footprint, with minor compromise in accuracy. The experimental results also exhibit a manifold of reduction in time complexity and improvement of model accuracy for the classical machine learning methods discussed further in this paper.

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

2511.03107

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Add feedback

Machine Learning for Climate Policy: Understanding Policy Progression in the European Green Deal

West, Patricia, Wan, Michelle WL, Hepburn, Alexander, Simpson, Edwin, Santos-Rodriguez, Raul, Clark, Jeffrey N

arXiv.org Artificial IntelligenceOct-21-2025

Climate change demands effective legislative action to mitigate its impacts. This study explores the application of machine learning (ML) to understand the progression of climate policy from announcement to adoption, focusing on policies within the European Green Deal. We present a dataset of 165 policies, incorporating text and metadata. We aim to predict a policy's progression status, and compare text representation methods, including TF-IDF, BERT, and ClimateBERT. Metadata features are included to evaluate the impact on predictive performance. On text features alone, ClimateBERT outperforms other approaches (RMSE = 0.17, R^2 = 0.29), while BERT achieves superior performance with the addition of metadata features (RMSE = 0.16, R^2 = 0.38). Using methods from explainable AI highlights the influence of factors such as policy wording and metadata including political party and country representation. These findings underscore the potential of ML tools in supporting climate policy analysis and decision-making.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2510.16233

Country: