AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Description-Enhanced Label Embedding Contrastive Learning for Text Classification

Zhang, Kun, Wu, Le, Lv, Guangyi, Chen, Enhong, Ruan, Shulan, Liu, Jing, Zhang, Zhiqiang, Zhou, Jun, Wang, Meng

arXiv.org Artificial IntelligenceJun-14-2023

Text Classification is one of the fundamental tasks in natural language processing, which requires an agent to determine the most appropriate category for input sentences. Recently, deep neural networks have achieved impressive performance in this area, especially Pre-trained Language Models (PLMs). Usually, these methods concentrate on input sentences and corresponding semantic embedding generation. However, for another essential component: labels, most existing works either treat them as meaningless one-hot vectors or use vanilla embedding methods to learn label representations along with model training, underestimating the semantic information and guidance that these labels reveal. To alleviate this problem and better exploit label information, in this paper, we employ Self-Supervised Learning (SSL) in model learning process and design a novel self-supervised Relation of Relation (R2) classification task for label utilization from a one-hot manner perspective. Then, we propose a novel Relation of Relation Learning Network (R2-Net) for text classification, in which text classification and R2 classification are treated as optimization targets. Meanwhile, triplet loss is employed to enhance the analysis of differences and connections among labels. Moreover, considering that one-hot usage is still short of exploiting label information, we incorporate external knowledge from WordNet to obtain multi-aspect descriptions for label semantic learning and extend R2-Net to a novel Description-Enhanced Label Embedding network (DELE) from a label embedding perspective. ...

classification, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.08817

Country:

North America > United States (0.14)
Asia > China > Anhui Province > Hefei (0.05)
Asia > China > Beijing > Beijing (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Rank-Aware Negative Training for Semi-Supervised Text Classification

Murtadha, Ahmed, Pan, Shengfeng, Bo, Wen, Su, Jianlin, Cao, Xinxin, Zhang, Wenze, Liu, Yunfeng

arXiv.org Artificial IntelligenceJun-13-2023

Semi-supervised text classification-based paradigms (SSTC) typically employ the spirit of self-training. The key idea is to train a deep classifier on limited labeled texts and then iteratively predict the unlabeled texts as their pseudo-labels for further training. However, the performance is largely affected by the accuracy of pseudo-labels, which may not be significant in real-world scenarios. This paper presents a Rank-aware Negative Training (RNT) framework to address SSTC in learning with noisy label manner. To alleviate the noisy information, we adapt a reasoning with uncertainty-based approach to rank the unlabeled texts based on the evidential support received from the labeled texts. Moreover, we propose the use of negative training to train RNT based on the concept that ``the input instance does not belong to the complementary label''. A complementary label is randomly selected from all labels except the label on-target. Intuitively, the probability of a true label serving as a complementary label is low and thus provides less noisy information during the training, resulting in better performance on the test data. Finally, we evaluate the proposed solution on various text classification benchmark datasets. Our extensive experiments show that it consistently overcomes the state-of-the-art alternatives in most scenarios and achieves competitive performance in the others. The code of RNT is publicly available at:https://github.com/amurtadha/RNT.

machine learning, natural language, text classification, (19 more...)

arXiv.org Artificial Intelligence

2306.07621

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Weakly-Supervised Scientific Document Classification via Retrieval-Augmented Multi-Stage Training

Xu, Ran, Yu, Yue, Ho, Joyce C., Yang, Carl

arXiv.org Artificial IntelligenceJun-12-2023

Scientific document classification is a critical task for a wide range of applications, but the cost of obtaining massive amounts of human-labeled data can be prohibitive. To address this challenge, we propose a weakly-supervised approach for scientific document classification using label names only. In scientific domains, label names often include domain-specific concepts that may not appear in the document corpus, making it difficult to match labels and documents precisely. To tackle this issue, we propose WANDER, which leverages dense retrieval to perform matching in the embedding space to capture the semantics of label names. We further design the label name expansion module to enrich the label name representations. Lastly, a self-training step is used to refine the predictions. The experiments on three datasets show that WANDER outperforms the best baseline by 11.9% on average. Our code will be published at https://github.com/ritaranx/wander.

information retrieval, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3539618.3592085

2306.07193

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.05)
North America > United States > Georgia > Fulton County > Atlanta (0.05)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.93)
Health & Medicine > Therapeutic Area > Immunology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)

Add feedback

Imbalanced Multi-label Classification for Business-related Text with Moderately Large Label Spaces

Arslan, Muhammad, Cruz, Christophe

arXiv.org Artificial IntelligenceJun-12-2023

In this study, we compared the performance of four different methods for multi-label text classification using a specific imbalanced business dataset. The four methods we evaluated were fine-tuned BERT, Binary Relevance, Classifier Chains, and Label Powerset. The results show that fine-tuned BERT outperforms the other three methods by a significant margin, achieving high values of accuracy, F1-Score, Precision, and Recall. Binary Relevance also performs well on this dataset, while Classifier Chains and Label Powerset demonstrate relatively poor performance. These findings highlight the effectiveness of fine-tuned BERT for multi-label text classification tasks, and suggest that it may be a useful tool for businesses seeking to analyze complex and multifaceted texts.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

2306.07046

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > France > Bourgogne-Franche-Comté > Côte-d'Or > Dijon (0.04)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

CELDA: Leveraging Black-box Language Model as Enhanced Classifier without Labels

Cho, Hyunsoo, Kim, Youna, Lee, Sang-goo

arXiv.org Artificial IntelligenceJun-9-2023

Utilizing language models (LMs) without internal access is becoming an attractive paradigm in the field of NLP as many cutting-edge LMs are released through APIs and boast a massive scale. The de-facto method in this type of black-box scenario is known as prompting, which has shown progressive performance enhancements in situations where data labels are scarce or unavailable. Despite their efficacy, they still fall short in comparison to fully supervised counterparts and are generally brittle to slight modifications. In this paper, we propose Clustering-enhanced Linear Discriminative Analysis, a novel approach that improves the text classification accuracy with a very weak-supervision signal (i.e., name of the labels). Our framework draws a precise decision boundary without accessing weights or gradients of the LM model or data labels. The core ideas of CELDA are twofold: (1) extracting a refined pseudo-labeled dataset from an unlabeled dataset, and (2) training a lightweight and robust model on the top of LM, which learns an accurate decision boundary from an extracted noisy dataset. Throughout in-depth investigations on various datasets, we demonstrated that CELDA reaches new state-of-the-art in weakly-supervised text classification and narrows the gap with a fully-supervised model. Additionally, our proposed methodology can be applied universally to any LM and has the potential to scale to larger models, making it a more viable option for utilizing large LMs.

machine learning, natural language, text classification, (16 more...)

arXiv.org Artificial Intelligence

2306.02693

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry:

Leisure & Entertainment (0.68)
Transportation > Air (0.62)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.55)

Add feedback

HiTIN: Hierarchy-aware Tree Isomorphism Network for Hierarchical Text Classification

Zhu, He, Zhang, Chong, Huang, Junjie, Wu, Junran, Xu, Ke

arXiv.org Artificial IntelligenceJun-9-2023

Hierarchical text classification (HTC) is a challenging subtask of multi-label classification as the labels form a complex hierarchical structure. Existing dual-encoder methods in HTC achieve weak performance gains with huge memory overheads and their structure encoders heavily rely on domain knowledge. Under such observation, we tend to investigate the feasibility of a memory-friendly model with strong generalization capability that could boost the performance of HTC without prior statistics or label semantics. In this paper, we propose Hierarchy-aware Tree Isomorphism Network (HiTIN) to enhance the text representations with only syntactic information of the label hierarchy. Specifically, we convert the label hierarchy into an unweighted tree structure, termed coding tree, with the guidance of structural entropy. Then we design a structure encoder to incorporate hierarchy-aware information in the coding tree into text representations. Besides the text encoder, HiTIN only contains a few multi-layer perceptions and linear transformations, which greatly saves memory. We conduct experiments on three commonly used datasets and the results demonstrate that HiTIN could achieve better test performance and less memory consumption than state-of-the-art (SOTA) methods.

classification, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2305.15182

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Bias Against 93 Stigmatized Groups in Masked Language Models and Downstream Sentiment Classification Tasks

Mei, Katelyn X., Fereidooni, Sonia, Caliskan, Aylin

arXiv.org Artificial IntelligenceJun-8-2023

The rapid deployment of artificial intelligence (AI) models demands a thorough investigation of biases and risks inherent in these models to understand their impact on individuals and society. This study extends the focus of bias evaluation in extant work by examining bias against social stigmas on a large scale. It focuses on 93 stigmatized groups in the United States, including a wide range of conditions related to disease, disability, drug use, mental illness, religion, sexuality, socioeconomic status, and other relevant factors. We investigate bias against these groups in English pre-trained Masked Language Models (MLMs) and their downstream sentiment classification tasks. To evaluate the presence of bias against 93 stigmatized conditions, we identify 29 non-stigmatized conditions to conduct a comparative analysis. Building upon a psychology scale of social rejection, the Social Distance Scale, we prompt six MLMs: RoBERTa-base, RoBERTa-large, XLNet-large, BERTweet-base, BERTweet-large, and DistilBERT. We use human annotations to analyze the predicted words from these models, with which we measure the extent of bias against stigmatized groups. When prompts include stigmatized conditions, the probability of MLMs predicting negative words is approximately 20 percent higher than when prompts have non-stigmatized conditions. In the sentiment classification tasks, when sentences include stigmatized conditions related to diseases, disability, education, and mental illness, they are more likely to be classified as negative. We also observe a strong correlation between bias in MLMs and their downstream sentiment classifiers (r =0.79). The evidence indicates that MLMs and their downstream sentiment classification tasks exhibit biases against socially stigmatized groups.

natural language, stigmatized condition, text classification, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3593013.3594109 10.1145/3593013.3594109 10.1145/3593013.3594109

2306.0555

Country:

North America > United States > Washington > King County > Seattle (0.46)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Illinois > Cook County > Chicago (0.06)
(8 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Addiction Disorder (0.69)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)

Add feedback

T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification

Unanue, Inigo Jauregi, Haffari, Gholamreza, Piccardi, Massimo

arXiv.org Artificial IntelligenceJun-8-2023

Cross-lingual text classification leverages text classifiers trained in a high-resource language to perform text classification in other languages with no or minimal fine-tuning (zero/few-shots cross-lingual transfer). Nowadays, cross-lingual text classifiers are typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. However, the performance of these models vary significantly across languages and classification tasks, suggesting that the superposition of the language modelling and classification tasks is not always effective. For this reason, in this paper we propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages. The proposed approach couples 1) a neural machine translator translating from the targeted language to a high-resource language, with 2) a text classifier trained in the high-resource language, but the neural machine translator generates "soft" translations to permit end-to-end backpropagation during fine-tuning of the pipeline. Extensive experiments have been carried out over three cross-lingual text classification datasets (XNLI, MLDoc and MultiEURLEX), with the results showing that the proposed approach has significantly improved performance over a competitive baseline.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

2306.04996

Country:

North America > United States > California (0.14)
North America > United States > District of Columbia > Washington (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Are We Really Making Much Progress in Text Classification? A Comparative Review

Galke, Lukas, Diera, Andor, Lin, Bao Xin, Khera, Bhakti, Meuser, Tim, Singhal, Tushar, Karl, Fabian, Scherp, Ansgar

arXiv.org Artificial IntelligenceJun-4-2023

This study reviews and compares methods for single-label and multi-label text classification, categorized into bag-of-words, sequence-based, graph-based, and hierarchical methods. The comparison aggregates results from the literature over five single-label and seven multi-label datasets and complements them with new experiments. The findings reveal that all recently proposed graph-based and hierarchy-based methods fail to outperform pre-trained language models and sometimes perform worse than standard machine learning methods like a multilayer perceptron on a bag-of-words. To assess the true scientific progress in text classification, future work should thoroughly test against strong bag-of-words baselines and state-of-the-art pre-trained language models.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2204.03954

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)
(44 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

TART: Improved Few-shot Text Classification Using Task-Adaptive Reference Transformation

Lei, Shuo, Zhang, Xuchao, He, Jianfeng, Chen, Fanglan, Lu, Chang-Tien

arXiv.org Artificial IntelligenceJun-3-2023

Meta-learning has emerged as a trending technique to tackle few-shot text classification and achieve state-of-the-art performance. However, the performance of existing approaches heavily depends on the inter-class variance of the support set. As a result, it can perform well on tasks when the semantics of sampled classes are distinct while failing to differentiate classes with similar semantics. In this paper, we propose a novel Task-Adaptive Reference Transformation (TART) network, aiming to enhance the generalization by transforming the class prototypes to per-class fixed reference points in task-adaptive metric spaces. To further maximize divergence between transformed prototypes in task-adaptive metric spaces, TART introduces a discriminative reference regularization among transformed prototypes. Extensive experiments are conducted on four benchmark datasets and our method demonstrates clear superiority over the state-of-the-art models in all the datasets. In particular, our model surpasses the state-of-the-art method by 7.4% and 5.4% in 1-shot and 5-shot classification on the 20 Newsgroups dataset, respectively.

classification, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.02175

Country:

Asia > Myanmar (0.04)
Asia > Middle East > Israel (0.04)
North America > United States > Washington > King County > Redmond (0.04)
(3 more...)

Genre: Research Report > Promising Solution (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback