AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Exploring Machine Learning and Transformer-based Approaches for Deceptive Text Classification: A Comparative Analysis

Krishnan, Anusuya

arXiv.org Artificial IntelligenceAug-10-2023

Deceptive text classification is a critical task in natural language processing that aims to identify deceptive o fraudulent content. This study presents a comparative analysis of machine learning and transformer-based approaches for deceptive text classification. We investigate the effectiveness of traditional machine learning algorithms and state-of-the-art transformer models, such as BERT, XLNET, DistilBERT, and RoBERTa, in detecting deceptive text. A labeled dataset consisting of deceptive and non-deceptive texts is used for training and evaluation purposes. Through extensive experimentation, we compare the performance metrics, including accuracy, precision, recall, and F1 score, of the different approaches. The results of this study shed light on the strengths and limitations of machine learning and transformer-based methods for deceptive text classification, enabling researchers and practitioners to make informed decisions when dealing with deceptive content.

accuracy, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2308.05476

Country:

Asia > Middle East > UAE (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
South America > Brazil > Ceará > Fortaleza (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unsupervised Calibration through Prior Adaptation for Text Classification using Large Language Models

Estienne, Lautaro, Ferrer, Luciana, Vera, Matías, Piantanida, Pablo

arXiv.org Artificial IntelligenceAug-9-2023

A wide variety of natural language tasks are currently being addressed with large-scale language models (LLMs). These models are usually trained with a very large amount of unsupervised text data and adapted to perform a downstream natural language task using methods like fine-tuning, calibration or in-context learning. In this work, we propose an approach to adapt the prior class distribution to perform text classification tasks without the need for labelled samples and only few in-domain sample queries. The proposed approach treats the LLM as a black box, adding a stage where the model posteriors are calibrated to the task. Results show that these methods outperform the un-adapted model for different number of training shots in the prompt and a previous approach were calibration is performed without using any adaptation data.

language model, text classification, unsupervised calibration, (1 more...)

arXiv.org Artificial Intelligence

doi: 10.26615/issn.2603-2821.2023_002

2307.06713

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.60)

Add feedback

Adversarial Word Dilution as Text Data Augmentation in Low-Resource Regime

Chen, Junfan, Zhang, Richong, Luo, Zheyan, Hu, Chunming, Mao, Yongyi

arXiv.org Artificial IntelligenceAug-9-2023

Data augmentation is widely used in text classification, especially in the low-resource regime where a few examples for each class are available during training. Despite the success, generating data augmentations as hard positive examples that may increase their effectiveness is under-explored. This paper proposes an Adversarial Word Dilution (AWD) method that can generate hard positive examples as text data augmentations to train the low-resource text classification model efficiently. Our idea of augmenting the text data is to dilute the embedding of strong positive words by weighted mixing with unknown-word embedding, making the augmented inputs hard to be recognized as positive by the classification model. We adversarially learn the dilution weights through a constrained min-max optimization process with the guidance of the labels. Empirical studies on three benchmark datasets show that AWD can generate more effective data augmentations and outperform the state-of-the-art text data augmentation methods. The additional analysis demonstrates that the data augmentations generated by AWD are interpretable and can flexibly extend to new examples without further training.

machine learning, natural language, text classification, (14 more...)

arXiv.org Artificial Intelligence

2305.09287

Country:

North America > United States > Connecticut (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > Canada > Ontario > National Capital Region > Ottawa (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Comparative Study on TF-IDF feature Weighting Method and its Analysis using Unstructured Dataset

Das, Mamata, K., Selvakumar, Alphonse, P. J. A.

arXiv.org Artificial IntelligenceAug-8-2023

Text Classification is the process of categorizing text into the relevant categories and its algorithms are at the core of many Natural Language Processing (NLP). Term Frequency-Inverse Document Frequency (TF-IDF) and NLP are the most highly used information retrieval methods in text classification. We have investigated and analyzed the feature weighting method for text classification on unstructured data. The proposed model considered two features N-Grams and TF-IDF on the IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis. Then we have used the state-of-the-art classifier to validate the method i.e., Support Vector Machine (SVM), Logistic Regression, Multinomial Naive Bayes (Multinomial NB), Random Forest, Decision Tree, and k-nearest neighbors (KNN). From those two feature extractions, a significant increase in feature extraction with TF-IDF features rather than based on N-Gram. TF-IDF got the maximum accuracy (93.81%), precision (94.20%), recall (93.81%), and F1-score (91.99%) value in Random Forest classifier.

machine learning, natural language, text classification, (16 more...)

arXiv.org Artificial Intelligence

2308.04037

Country:

North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(3 more...)

Genre: Research Report > New Finding (0.35)

Industry:

Information Technology > Services (1.00)
Media > Film (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)

Add feedback

Gzip versus bag-of-words for text classification

Opitz, Juri

arXiv.org Artificial IntelligenceAug-8-2023

KNN is a simple classifier that uses distance measurements between data points: For a given testing point, we calculate its distance to every other point from some labeled training set and check the labels of the K closest points (i.e., the K-Neirest Neighbors), predicting the label that we observe most frequently. Hence, it is straightforward to build a general text classifier from KNN, if we can equip it with a sensible distance measure between documents. Interestingly, recent findings [4] suggest that we can exploit compression to assess the distance of two documents, by comparing their individual compression lengths to the length of their compressed concatenation (we call this measurement gzip). With this approach, [4] show strong text classification performance across different data sets, sometimes achieving higher accuracy than trained neural classifiers such as BERT [2], especially in scenarios where only few training data are available. Against this background, it is not surprising that gzip has quickly attracted lots of attention.

machine learning, natural language, training data, (20 more...)

arXiv.org Artificial Intelligence

2307.15002

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.72)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Meta-Tsallis-Entropy Minimization: A New Self-Training Approach for Domain Adaptation on Text Classification

Lu, Menglong, Huang, Zhen, Tian, Zhiliang, Zhao, Yunxiang, Fei, Xuanyu, Li, Dongsheng

arXiv.org Artificial IntelligenceAug-4-2023

Text classification is a fundamental task for natural language processing, and adapting text classification models across domains has broad applications. Self-training generates pseudo-examples from the model's predictions and iteratively trains on the pseudo-examples, i.e., minimizes the loss on the source domain and the Gibbs entropy on the target domain. However, Gibbs entropy is sensitive to prediction errors, and thus, self-training tends to fail when the domain shift is large. In this paper, we propose Meta-Tsallis Entropy minimization (MTEM), which applies a meta-learning algorithm to optimize the instance adaptive Tsallis entropy on the target domain. To reduce the computation cost of MTEM, we propose an approximation technique to approximate the Second-order derivation involved in the meta-learning. To efficiently generate pseudo labels, we propose an annealing sampling mechanism for exploring the model's prediction probability. Theoretically, we prove the convergence of the meta-learning algorithm in MTEM and analyze the effectiveness of MTEM in achieving domain adaptation. Experimentally, MTEM improves the adaptation performance of BERT with an average of 4 percent on the benchmark dataset.

entropy index, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2308.02746

Country:

Europe (0.14)
Asia > Middle East > Republic of Türkiye (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.40)

Industry: Consumer Products & Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.91)

Add feedback

Industrial Memories: Exploring the Findings of Government Inquiries with Neural Word Embedding and Machine Learning

Leavy, Susan, Pine, Emilie, Keane, Mark T

arXiv.org Artificial IntelligenceAug-2-2023

We present a text mining system to support the exploration of large volumes of text detailing the findings of government inquiries. Despite their historical significance and potential societal impact, key findings of inquiries are often hidden within lengthy documents and remain inaccessible to the general public. We transform the findings of the Irish government's inquiry into industrial schools and through the use of word embedding, text classification and visualization, present an interactive web-based platform that enables the exploration of the text to uncover new historical insights.

data mining, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-030-10997-4_52

2308.02556

Country: Europe > Ireland > Leinster > County Dublin > Dublin (0.05)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.98)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.51)
Information Technology > Data Science > Data Mining > Text Mining (0.36)

Add feedback

Recent Advances in Hierarchical Multi-label Text Classification: A Survey

Liu, Rundong, Liang, Wenhan, Luo, Weijun, Song, Yuxiang, Zhang, He, Xu, Ruohua, Li, Yunfeng, Liu, Ming

arXiv.org Artificial IntelligenceJul-30-2023

Hierarchical multi-label text classification aims to classify the input text into multiple labels, among which the labels are structured and hierarchical. It is a vital task in many real world applications, e.g. scientific literature archiving. In this paper, we survey the recent progress of hierarchical multi-label text classification, including the open sourced data sets, the main methods, evaluation metrics, learning strategies and the current challenges. A few future research directions are also listed for community to further improve this field.

classification, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2307.16265

Country:

North America > Dominican Republic (0.05)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
Asia > China > Beijing > Beijing (0.05)
(7 more...)

Genre:

Overview (1.00)
Research Report (0.84)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for Wikipedia Categories

Chi, Te-Yu, Tang, Yu-Meng, Lu, Chia-Wen, Zhang, Qiu-Xia, Jang, Jyh-Shing Roger

arXiv.org Artificial IntelligenceJul-28-2023

Our research focuses on solving the zero-shot text classification problem in NLP, with a particular emphasis on innovative self-training strategies. To achieve this objective, we propose a novel self-training strategy that uses labels rather than text for training, significantly reducing the model's training time. Specifically, we use categories from Wikipedia as our training set and leverage the SBERT pre-trained model to establish positive correlations between pairs of categories within the same text, facilitating associative training. For new test datasets, we have improved the original self-training approach, eliminating the need for prior training and testing data from each target dataset. Instead, we adopt Wikipedia as a unified training dataset to better approximate the zero-shot scenario. This modification allows for rapid fine-tuning and inference across different datasets, greatly reducing the time required for self-training. Our experimental results demonstrate that this method can adapt the model to the target dataset within minutes. Compared to other BERT-based transformer models, our approach significantly reduces the amount of training data by training only on labels, not the actual text, and greatly improves training efficiency by utilizing a unified training set. Additionally, our method achieves state-of-the-art results on both the Yahoo Topic and AG News datasets.

accuracy, category, dataset, (14 more...)

arXiv.org Artificial Intelligence

2307.15293

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.87)

Add feedback

Adaptive Meta-learner via Gradient Similarity for Few-shot Text Classification

Lei, Tianyi, Hu, Honghui, Luo, Qiaoyang, Peng, Dezhong, Wang, Xu

arXiv.org Artificial IntelligenceJul-28-2023

Few-shot text classification aims to classify the text under the few-shot scenario. Most of the previous methods adopt optimization-based meta learning to obtain task distribution. However, due to the neglect of matching between the few amount of samples and complicated models, as well as the distinction between useful and useless task features, these methods suffer from the overfitting issue. To address this issue, we propose a novel Adaptive Meta-learner via Gradient Similarity (AMGS) method to improve the model generalization ability to a new task. Specifically, the proposed AMGS alleviates the overfitting based on two aspects: (i) acquiring the potential semantic representation of samples and improving model generalization through the self-supervised auxiliary task in the inner loop, (ii) leveraging the adaptive meta-learner via gradient similarity to add constraints on the gradient obtained by base-learner in the outer loop. Moreover, we make a systematic analysis of the influence of regularization on the entire framework. Experimental results on several benchmarks demonstrate that the proposed AMGS consistently improves few-shot text classification performance compared with the state-of-the-art optimization-based meta-learning approaches.

machine learning, natural language, text classification, (16 more...)

arXiv.org Artificial Intelligence

2209.04702

Country:

Asia > China (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.83)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)

Add feedback