AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Information-Theoretic Complementary Prompts for Improved Continual Text Classification

Zhang, Duzhen, Ren, Yong, Li, Chenxing, Yu, Dong, Zhang, Tielin

arXiv.org Artificial IntelligenceMay-28-2025

Continual Text Classification (CTC) aims to continuously classify new text data over time while minimizing catastrophic forgetting of previously acquired knowledge. However, existing methods often focus on task-specific knowledge, overlooking the importance of shared, task-agnostic knowledge. Inspired by the complementary learning systems theory, which posits that humans learn continually through the interaction of two systems -- the hippocampus, responsible for forming distinct representations of specific experiences, and the neocortex, which extracts more general and transferable representations from past experiences -- we introduce Information-Theoretic Complementary Prompts (InfoComp), a novel approach for CTC. InfoComp explicitly learns two distinct prompt spaces: P(rivate)-Prompt and S(hared)-Prompt. These respectively encode task-specific and task-invariant knowledge, enabling models to sequentially learn classification tasks without relying on data replay. To promote more informative prompt learning, InfoComp uses an information-theoretic framework that maximizes mutual information between different parameters (or encoded representations). Within this framework, we design two novel loss functions: (1) to strengthen the accumulation of task-specific knowledge in P-Prompt, effectively mitigating catastrophic forgetting, and (2) to enhance the retention of task-invariant knowledge in S-Prompt, improving forward knowledge transfer. Extensive experiments on diverse CTC benchmarks show that our approach outperforms previous state-of-the-art methods.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.20933

Country:

Asia > China (0.46)
Asia > Middle East > UAE (0.46)
North America > United States (0.28)
North America > Canada (0.28)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.93)
Research Report > Promising Solution (0.68)

Industry:

Information Technology > Security & Privacy (0.46)
Health & Medicine > Therapeutic Area > Neurology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Continuous Multi-Task Pre-training for Malicious URL Detection and Webpage Classification

Li, Yujie, Liu, Yiwei, Li, Peiyue, Jia, Yifan, Wang, Yanbin

arXiv.org Artificial IntelligenceMay-27-2025

Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management. In recent years, extensive research has explored using BERT or similar language models to replace traditional machine learning methods for detecting malicious URLs and classifying webpages. While previous studies show promising results, they often apply existing language models to these tasks without accounting for the inherent differences in domain data (e.g., URLs being loosely structured and semantically sparse compared to text), leaving room for performance improvement. Furthermore, current approaches focus on single tasks and have not been tested in multi-task scenarios. To address these challenges, we propose urlBERT, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs. To achieve it, we propose to use 5 unsupervised pretraining tasks to capture multi-level information of URL lexical, syntax, and semantics, and generate contrastive and adversarial representations. Furthermore, to avoid inter-pre-training competition and interference, we proposed a grouped sequential learning method to ensure effective training across multi-tasks. Finally, we leverage a two-stage fine-tuning approach to improve the training stability and efficiency of the task model. To assess the multitasking potential of urlBERT, we fine-tune the task model in both single-task and multi-task modes. The former creates a classification model for a single task, while the latter builds a classification model capable of handling multiple tasks. We evaluate urlBERT on three downstream tasks: phishing URL detection, advertising URL detection, and webpage classification. The results demonstrate that urlBERT outperforms standard pre-trained models, and its multi-task mode is capable of addressing the real-world demands of multitasking.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2402.11495

Country: Asia > China (0.93)

Genre: Research Report > New Finding (0.87)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification

Neural Information Processing SystemsMay-26-2025, 18:16:34 GMT

Extreme multi-label text classification (XMC) seeks to find relevant labels from an extreme large label collection for a given text input. Many real-world applications can be formulated as XMC problems, such as recommendation systems, document tagging and semantic search. Recently, transformer based XMC methods, such as X-Transformer and LightXML, have shown significant improvement over other XMC methods. Despite leveraging pre-trained transformer models for text representation, the fine-tuning procedure of transformer models on large label space still has lengthy computational time even with powerful GPUs. In this paper, we propose a novel recursive approach, XR-Transformer to accelerate the procedure through recursively fine-tuning transformer models on a series of multi-resolution objectives related to the original XMC objective function.

artificial intelligence, machine learning, natural language, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.65)

Add feedback

The Ripple Effect: On Unforeseen Complications of Backdoor Attacks

Zhang, Rui, Shen, Yun, Li, Hongwei, Jiang, Wenbo, Chen, Hanxiao, Zhang, Yuan, Xu, Guowen, Zhang, Yang

arXiv.org Artificial IntelligenceMay-20-2025

Recent research highlights concerns about the trustworthiness of third-party Pre-Trained Language Models (PTLMs) due to potential backdoor attacks. These backdoored PTLMs, however, are effective only for specific pre-defined downstream tasks. In reality, these PTLMs can be adapted to many other unrelated downstream tasks. Such adaptation may lead to unforeseen consequences in downstream model outputs, consequently raising user suspicion and compromising attack stealthiness. We refer to this phenomenon as backdoor complications. In this paper, we undertake the first comprehensive quantification of backdoor complications. Through extensive experiments using 4 prominent PTLMs and 16 text classification benchmark datasets, we demonstrate the widespread presence of backdoor complications in downstream models fine-tuned from backdoored PTLMs. The output distribution of triggered samples significantly deviates from that of clean samples. Consequently, we propose a backdoor complication reduction method leveraging multi-task learning to mitigate complications without prior knowledge of downstream tasks. The experimental results demonstrate that our proposed method can effectively reduce complications while maintaining the efficacy and consistency of backdoor attacks. Our code is available at https://github.com/zhangrui4041/Backdoor_Complications.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.11586

Country:

Asia > Middle East > Palestine (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.45)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

KDH-MLTC: Knowledge Distillation for Healthcare Multi-Label Text Classification

Sakai, Hajar, Lam, Sarah S.

arXiv.org Artificial IntelligenceMay-13-2025

The increasing volume of healthcare textual data requires computationally efficient, yet highly accurate classification approaches able to handle the nuanced and complex nature of medical terminology. This research presents Knowledge Distillation for Healthcare Multi - Label Text Classification (KDH - MLTC), a framework leveraging model compr ession and Large Language Models (LLMs). The proposed approach addresses conventional healthcare Multi - Label Text Classification (MLTC) challenges by integrating knowledge distillation and sequential fine - tuning, subsequently optimized through Particle Swa rm Optimization (PSO) for hyperparameter tuning. KDH - MLTC transfers knowledge from a more complex teacher LLM ( i.e., BERT) to a lighter student LLM ( i.e., DistilBERT) through sequential training adapted to MLTC that preserves the teacher's learned information while significantly reducing computational requirements. As a result, the classification is enabled to be conducted locally, making it suitable for healthcare textual data characterized by sensitivity and, therefore, ensuring HIPAA compliance. The e xpe riments conducted on three medical literature datasets of different sizes, sampled from the Hallmark of Cancer (HoC) dataset, demonstrate that KDH - MLTC achieves superior performance compared to existing approaches, particularly for the largest dataset, reaching an F1 score of 82.70% 0.89%. Additionally, statistical validation and an ablation study ar e carried out, proving the robustness of KDH - MLTC. Furthermore, the PSO - based hyperparameter optimization process allow ed the identification of optimal configurations. The proposed approach contributes to healthcare text classification research, balancing efficiency requirements in resource - constrained healthcare settings with satisfactory accuracy demands.

kdh, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.07162

Country: North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Dynamic Domain Information Modulation Algorithm for Multi-domain Sentiment Analysis

Yue, Chunyi, Li, Ang

arXiv.org Artificial IntelligenceMay-13-2025

Multi-domain sentiment classification aims to mitigate poor performance models due to the scarcity of labeled data in a single domain, by utilizing data labeled from various domains. A series of models that jointly train domain classifiers and sentiment classifiers have demonstrated their advantages, because domain classification helps generate necessary information for sentiment classification. Intuitively, the importance of sentiment classification tasks is the same in all domains for multi-domain sentiment classification; but domain classification tasks are different because the impact of domain information on sentiment classification varies across different fields; this can be controlled through adjustable weights or hyper parameters. However, as the number of domains increases, existing hyperparameter optimization algorithms may face the following challenges: (1) tremendous demand for computing resources, (2) convergence problems, and (3) high algorithm complexity. To efficiently generate the domain information required for sentiment classification in each domain, we propose a dynamic information modulation algorithm. Specifically, the model training process is divided into two stages. In the first stage, a shared hyperparameter, which would control the proportion of domain classification tasks across all fields, is determined. In the second stage, we introduce a novel domain-aware modulation algorithm to adjust the domain information contained in the input text, which is then calculated based on a gradient-based and loss-based method. In summary, experimental results on a public sentiment analysis dataset containing 16 domains prove the superiority of the proposed method.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2505.0663

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Efficiency of Pre-training with Objective Masking in Pseudo Labeling for Semi-Supervised Text Classification

Hatefi, Arezoo, Vu, Xuan-Son, Bhuyan, Monowar, Drewes, Frank

arXiv.org Artificial IntelligenceMay-13-2025

We extend and study a semi-supervised model for text classification proposed earlier by Hatefi et al. for classification tasks in which document classes are described by a small number of gold-labeled examples, while the majority of training examples is unlabeled. The model leverages the teacher-student architecture of Meta Pseudo Labels in which a ''teacher'' generates labels for originally unlabeled training data to train the ''student'' and updates its own model iteratively based on the performance of the student on the gold-labeled portion of the data. We extend the original model of Hatefi et al. by an unsupervised pre-training phase based on objective masking, and conduct in-depth performance evaluations of the original model, our extension, and various independent baselines. Experiments are performed using three different datasets in two different languages (English and Swedish).

machine learning, natural language, text classification, (22 more...)

arXiv.org Artificial Intelligence

2505.06624

Country: North America (0.67)

Genre: Research Report > New Finding (0.93)

Industry:

Education (0.93)
Health & Medicine (0.93)
Media > News (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Towards Robust Few-Shot Text Classification Using Transformer Architectures and Dual Loss Strategies

Han, Xu, Sun, Yumeng, Huang, Weiqiang, Zheng, Hongye, Du, Junliang

arXiv.org Artificial IntelligenceMay-12-2025

Few-shot text classification has important application value in low-resource environments. This paper proposes a strategy that combines adaptive fine-tuning, contrastive learning, and regularization optimization to improve the classification performance of Transformer-based models. Experiments on the FewRel 2.0 dataset show that T5-small, DeBERTa-v3, and RoBERTa-base perform well in few-shot tasks, especially in the 5-shot setting, which can more effectively capture text features and improve classification accuracy. The experiment also found that there are significant differences in the classification difficulty of different relationship categories. Some categories have fuzzy semantic boundaries or complex feature distributions, making it difficult for the standard cross entropy loss to learn the discriminative information required to distinguish categories. By introducing contrastive loss and regularization loss, the generalization ability of the model is enhanced, effectively alleviating the overfitting problem in few-shot environments. In addition, the research results show that the use of Transformer models or generative architectures with stronger self-attention mechanisms can help improve the stability and accuracy of few-shot classification.

large language model, machine learning, natural language, (5 more...)

arXiv.org Artificial Intelligence

2505.06145

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.60)

Add feedback

Arabic Metaphor Sentiment Classification Using Semantic Information

Alsiyat, Israa

arXiv.org Artificial IntelligenceApr-29-2025

In this paper, I discuss the testing of the Arabic Metaphor Corpus (AMC) [1] using newly designed automatic tools for sentiment classification for AMC based on semantic tags. The tool incorporates semantic emotional tags for sentiment classification. I evaluate the tool using standard methods, which are F-score, recall, and precision. The method is to show the impact of Arabic online metaphors on sentiment through the newly designed tools. To the best of our knowledge, this is the first approach to conduct sentiment classification for Arabic metaphors using semantic tags to find the impact of the metaphor. Keywords: Arabic metaphor, sentiment analysis, NLP, Arabic semantic tagger 1 Introduction To the best of our knowledge, there are no existing tools specifically developed for Arabic metaphor identification in the context of sentiment analysis. Identifying Arabic metaphors requires pre-annotated data, and in the absence of pre-annotation, a substantial corpus of Arabic metaphors would be necessary to train advanced machine learning algorithms for automatic identification. So, I am using the Arabic Metaphor Corpus (AMC) [1]. In terms of the Arabic metaphor, the very recent study conducted for Arabic metaphor identification with pre-annotated data without integrating sentiment classification is [2].

classification, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijci.2025.140203

2504.1959

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)

Add feedback

Specialized text classification: an approach to classifying Open Banking transactions

TA, Duc Tuyen, Saad, Wajdi Ben, Oh, Ji Young

arXiv.org Artificial IntelligenceApr-18-2025

Specialized text classification: an approach to classifying Open Banking transactions Duc Tuyen Ta, Wajdi Ben Saad, Ji Y oung Oh Data Science Team - Oney Bank - France Abstract --With the introduction of the PSD2 regulation in the EU which established the Open Banking framework, a new window of opportunities has opened for banks and fintechs to explore and enrich Bank transaction descriptions with the aim of building a better understanding of customer behavior, while using this understanding to prevent fraud, reduce risks and offer more competitive and tailored services. And although the usage of natural language processing models and techniques has seen an incredible progress in various applications and domains over the past few years, custom applications based on domain-specific text corpus remain unaddressed especially in the banking sector . In this paper, we introduce a language-based Open Banking transaction classification system with a focus on the french market and french language text. The system encompasses data collection, labeling, preprocessing, modeling, and evaluation stages. Unlike previous studies that focus on general classification approaches, this system is specifically tailored to address the challenges posed by training a language model with a specialized text corpus (Banking data in the French context).

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/CSIT61576.2023.10324203

2504.12319

Country: Europe > France (0.25)

Genre: Research Report > New Finding (0.48)

Industry:

Banking & Finance (1.00)
Information Technology > Security & Privacy (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.71)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback