AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

A Comparative Study of Machine Unlearning Techniques for Image and Text Classification Models

Safa, Omar M., Abdelaziz, Mahmoud M., Eltawy, Mustafa, Mamdouh, Mohamed, Gharib, Moamen, Eltenihy, Salaheldin, Ghanem, Nagia M., Ismail, Mohamed M.

arXiv.org Artificial IntelligenceDec-27-2024

Machine Unlearning has emerged as a critical area in artificial intelligence, addressing the need to selectively remove learned data from machine learning models in response to data privacy regulations. This paper provides a comprehensive comparative analysis of six state-of-theart unlearning techniques applied to image and text classification tasks. We evaluate their performance, efficiency, and compliance with regulatory requirements, highlighting their strengths and limitations in practical scenarios. By systematically analyzing these methods, we aim to provide insights into their applicability, challenges,and tradeoffs, fostering advancements in the field of ethical and adaptable machine learning.

image and text classification model, machine learning, natural language, (3 more...)

arXiv.org Artificial Intelligence

2412.19583

Genre: Research Report (0.69)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.60)

Add feedback

Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs

Zhong, Shan, Zeng, Jiahao, Yu, Yongxin, Lin, Bohong

arXiv.org Artificial IntelligenceDec-25-2024

This paper proposes a Clustering, Labeling, then Augmenting framework that significantly enhances performance in Semi-Supervised Text Classification (SSTC) tasks, effectively addressing the challenge of vast datasets with limited labeled examples. Unlike traditional SSTC approaches that rely on a predefined small set of labeled data to generate pseudo-labels for the unlabeled data, this framework innovatively employs clustering to select representative "landmarks" for labeling. These landmarks subsequently act as intermediaries in an ensemble of augmentation techniques, including Retrieval-Augmented Generation (RAG), Large Language Model (LLMs)-based rewriting, and synonym substitution, to generate synthetic labeled data without making pseudo-labels for the unlabeled data. Empirical results show that even in complex text document classification scenarios involving over 100 categories, our method achieves state-of-the-art accuracies of 95.41% on the Reuters dataset and 82.43% on the Web of Science dataset. Our approach significantly reduces the reliance on human labeling efforts and the associated expenses, while simultaneously ensuring high data quality and minimizing privacy risks. The finetuning results further show the efficiency of fine-tuning LLMs for text classification tasks, highlighting a robust solution for leveraging limited labeled data.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2411.06175

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Rheumatology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

Rusli, Andre, Shishido, Makoto

arXiv.org Artificial IntelligenceDec-23-2024

This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.

machine learning, natural language, text classification, (19 more...)

arXiv.org Artificial Intelligence

2412.17361

Country: Asia > Japan > Honshū (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.51)

Add feedback

GAProtoNet: A Multi-head Graph Attention-based Prototypical Network for Interpretable Text Classification

Wen, Ximing, Tan, Wenjuan, Weber, Rosina O.

arXiv.org Artificial IntelligenceDec-19-2024

Pretrained transformer-based Language Models (LMs) are well-known for their ability to achieve significant improvement on text classification tasks with their powerful word embeddings, but their black-box nature, which leads to a lack of interpretability, has been a major concern. In this work, we introduce GAProtoNet, a novel white-box Multi-head Graph Attention-based Prototypical Network designed to explain the decisions of text classification models built with LM encoders. In our approach, the input vector and prototypes are regarded as nodes within a graph, and we utilize multi-head graph attention to selectively construct edges between the input node and prototype nodes to learn an interpretable prototypical representation. During inference, the model makes decisions based on a linear combination of activated prototypes weighted by the attention score assigned for each prototype, allowing its choices to be transparently explained by the attention weights and the prototypes projected into the closest matching training examples. Experiments on multiple public datasets show our approach achieves superior results without sacrificing the accuracy of the original black-box LMs. We also compare with four alternative prototypical network variations and our approach achieves the best accuracy and F1 among all. Our case study and visualization of prototype clusters also demonstrate the efficiency in explaining the decisions of black-box models built with LMs.

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

2409.13312

Country:

North America > United States > Oregon (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Deep Learning and Machine Learning -- Natural Language Processing: From Theory to Application

Chen, Keyu, Fei, Cheng, Bi, Ziqian, Liu, Junyu, Peng, Benji, Zhang, Sen, Pan, Xuanhe, Xu, Jiawei, Wang, Jinlang, Yin, Caitlyn Heqi, Zhang, Yichao, Feng, Pohsun, Wen, Yizhu, Wang, Tianyang, Li, Ming, Ren, Jintao, Niu, Qian, Chen, Silin, Hsieh, Weiche, Yan, Lawrence K. Q., Liang, Chia Xin, Xu, Han, Tseng, Hong-Ming, Song, Xinyuan, Liu, Ming

arXiv.org Artificial IntelligenceDec-17-2024

With a focus on natural language processing (NLP) and the role of large language models (LLMs), we explore the intersection of machine learning, deep learning, and artificial intelligence. As artificial intelligence continues to revolutionize fields from healthcare to finance, NLP techniques such as tokenization, text classification, and entity recognition are essential for processing and understanding human language. This paper discusses advanced data preprocessing techniques and the use of frameworks like Hugging Face for implementing transformer-based models. Additionally, it highlights challenges such as handling multilingual data, reducing bias, and ensuring model robustness. By addressing key aspects of data processing and model fine-tuning, this work aims to provide insights into deploying effective and ethically sound AI solutions.

information retrieval, large language model, machine learning, (25 more...)

arXiv.org Artificial Intelligence

2411.05026

Country:

North America > United States (1.00)
Asia (1.00)

Genre:

Workflow (1.00)
Overview (1.00)
Instructional Material > Course Syllabus & Notes (0.67)
(2 more...)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Information Technology > Services (1.00)
(11 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification

Li, Nan, Kang, Bo, De Bie, Tijl

arXiv.org Artificial IntelligenceDec-17-2024

Text classification with hierarchical labels is a prevalent and challenging task in natural language processing. Examples include assigning ICD codes to patient records, tagging patents into IPC classes, assigning EUROVOC descriptors to European legal texts, and more. Despite its widespread applications, a comprehensive understanding of state-of-the-art methods across different domains has been lacking. In this paper, we provide the first comprehensive cross-domain overview with empirical analysis of state-of-the-art methods. We propose a unified framework that positions each method within a common structure to facilitate research. Our empirical analysis yields key insights and guidelines, confirming the necessity of learning across different research areas to design effective methods. Notably, under our unified evaluation pipeline, we achieved new state-of-the-art results by applying techniques beyond their original domains.

classification, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2412.12744

Country:

North America > United States (0.28)
Europe > Spain > Castile and León > Salamanca Province > Salamanca (0.04)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
Europe > Belgium (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Promising Solution (0.68)

Industry:

Government (1.00)
Law (0.89)
Health & Medicine > Health Care Providers & Services (0.48)
Health & Medicine > Health Care Technology > Medical Record (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.73)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

The Reliability Paradox: Exploring How Shortcut Learning Undermines Language Model Calibration

Bihani, Geetanjali, Rayz, Julia

arXiv.org Artificial IntelligenceDec-17-2024

The advent of pre-trained language models (PLMs) has enabled significant performance gains in the field of natural language processing. However, recent studies have found PLMs to suffer from miscalibration, indicating a lack of accuracy in the confidence estimates provided by these models. Current evaluation methods for PLM calibration often assume that lower calibration error estimates indicate more reliable predictions. However, fine-tuned PLMs often resort to shortcuts, leading to overconfident predictions that create the illusion of enhanced performance but lack generalizability in their decision rules. The relationship between PLM reliability, as measured by calibration error, and shortcut learning, has not been thoroughly explored thus far. This paper aims to investigate this relationship, studying whether lower calibration error implies reliable decision rules for a language model. Our findings reveal that models with seemingly superior calibration portray higher levels of non-generalizable decision rules. This challenges the prevailing notion that well-calibrated models are inherently reliable. Our study highlights the need to bridge the current gap between language model calibration and generalization objectives, urging the development of comprehensive frameworks to achieve truly robust and reliable language models.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

2412.15269

Country: South America (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

WordVIS: A Color Worth A Thousand Words

Khan, Umar, Saifullah, null, Agne, Stefan, Dengel, Andreas, Ahmed, Sheraz

arXiv.org Artificial IntelligenceDec-13-2024

Document classification is considered a critical element in automated document processing systems. In recent years multi-modal approaches have become increasingly popular for document classification. Despite their improvements, these approaches are underutilized in the industry due to their requirement for a tremendous volume of training data and extensive computational power. In this paper, we attempt to address these issues by embedding textual features directly into the visual space, allowing lightweight image-based classifiers to achieve state-of-the-art results using small-scale datasets in document classification. To evaluate the efficacy of the visual features generated from our approach on limited data, we tested on the standard dataset Tobacco-3482. Our experiments show a tremendous improvement in image-based classifiers, achieving an improvement of 4.64% using ResNet50 with no document pre-training. It also sets a new record for the best accuracy of the Tobacco-3482 dataset with a score of 91.14% using the image-based DocXClassifier with no document pre-training. The simplicity of the approach, its resource requirements, and subsequent results provide a good prospect for its use in industrial use cases.

classification, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.10155

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Germany (0.04)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.92)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.69)

Add feedback

Low-Resource Fast Text Classification Based on Intra-Class and Inter-Class Distance Calculation

Mao, Yanxu, Liu, Peipei, Cui, Tiehan, Liu, Congying, You, Datao

arXiv.org Artificial IntelligenceDec-13-2024

In recent years, text classification methods based on neural networks and pre-trained models have gained increasing attention and demonstrated excellent performance. However, these methods still have some limitations in practical applications: (1) They typically focus only on the matching similarity between sentences. However, there exists implicit high-value information both within sentences of the same class and across different classes, which is very crucial for classification tasks. (2) Existing methods such as pre-trained language models and graph-based approaches often consume substantial memory for training and text-graph construction. (3) Although some low-resource methods can achieve good performance, they often suffer from excessively long processing times. To address these challenges, we propose a low-resource and fast text classification model called LFTC. Our approach begins by constructing a compressor list for each class to fully mine the regularity information within intra-class data. We then remove redundant information irrelevant to the target classification to reduce processing time. Finally, we compute the similarity distance between text pairs for classification. We evaluate LFTC on 9 publicly available benchmark datasets, and the results demonstrate significant improvements in performance and processing time, especially under limited computational and data resources, highlighting its superior advantages.

classification, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.09922

Country: Asia > China (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Label-template based Few-Shot Text Classification with Contrastive Learning

Hou, Guanghua, Cao, Shuhui, Ouyang, Deqiang, Wang, Ning

arXiv.org Artificial IntelligenceDec-13-2024

As an algorithmic framework for learning to learn, meta-learning provides a promising solution for few-shot text classification. However, most existing research fail to give enough attention to class labels. Traditional basic framework building meta-learner based on prototype networks heavily relies on inter-class variance, and it is easily influenced by noise. To address these limitations, we proposes a simple and effective few-shot text classification framework. In particular, the corresponding label templates are embed into input sentences to fully utilize the potential value of class labels, guiding the pre-trained model to generate more discriminative text representations through the semantic information conveyed by labels. With the continuous influence of label semantics, supervised contrastive learning is utilized to model the interaction information between support samples and query samples. Furthermore, the averaging mechanism is replaced with an attention mechanism to highlight vital semantic information. To verify the proposed scheme, four typical datasets are employed to assess the performance of different methods. Experimental results demonstrate that our method achieves substantial performance enhancements and outperforms existing state-of-the-art models on few-shot text classification tasks.

contrastive learning, representation, text classification, (13 more...)

arXiv.org Artificial Intelligence

2412.1011

Country: Asia > China > Chongqing Province > Chongqing (0.05)

Genre:

Research Report > Promising Solution (0.68)
Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback