AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Benchmarking Multilabel Topic Classification in the Kyrgyz Language

Alekseev, Anton, Nikolenko, Sergey I., Kabaeva, Gulnara

arXiv.org Artificial IntelligenceAug-30-2023

Kyrgyz is a very underrepresented language in terms of modern natural language processing resources. In this work, we present a new public benchmark for topic classification in Kyrgyz, introducing a dataset based on collected and annotated data from the news site 24.KG and presenting several baseline models for news classification in the multilabel setting. We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.

benchmarking multilabel topic classification, kyrgyz language

arXiv.org Artificial Intelligence

2308.15952

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.60)

Add feedback

Towards One-Shot Learning for Text Classification using Inductive Logic Programming

Milani, Ghazal Afroozi, Cyrus, Daniel, Tamaddoni-Nezhad, Alireza

arXiv.org Artificial IntelligenceAug-30-2023

With the ever-increasing potential of AI to perform personalised tasks, it is becoming essential to develop new machine learning techniques which are data-efficient and do not require hundreds or thousands of training data. In this paper, we explore an Inductive Logic Programming approach for one-shot text classification. In particular, we explore the framework of Meta-Interpretive Learning (MIL), along with using common-sense background knowledge extracted from ConceptNet. Results indicate that MIL can learn text classification rules from a small number of training examples. Moreover, the higher complexity of chosen examples, the higher accuracy of the outcome.

inductive logic programming, one-shot learning, text classification

arXiv.org Artificial Intelligence

doi: 10.4204/EPTCS.385.9

2308.15885

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.80)

Add feedback

MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification

Ye, Hui, Sunderraman, Rajshekhar, Ji, Shihao

arXiv.org Artificial IntelligenceAug-24-2023

The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency(TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations and static dense sentence features. Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets. As for the speed, MatchXML outperforms the competing methods on all the six datasets. Our source code is publicly available at https://github.com/huiyegit/MatchXML.

classification, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2308.13139

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

Taken by Surprise: Contrast effect for Similarity Scores

Bachlechner, Thomas C., Martone, Mario, Schillo, Marjorie

arXiv.org Artificial IntelligenceAug-22-2023

Accurately evaluating the similarity of object vector embeddings is of critical importance for natural language processing, information retrieval and classification tasks. Popular similarity scores (e.g cosine similarity) are based on pairs of embedding vectors and disregard the distribution of the ensemble from which objects are drawn. Human perception of object similarity significantly depends on the context in which the objects appear. In this work we propose the $\textit{surprise score}$, an ensemble-normalized similarity metric that encapsulates the contrast effect of human perception and significantly improves the classification performance on zero- and few-shot document classification tasks. This score quantifies the surprise to find a given similarity between two elements relative to the pairwise ensemble similarities. We evaluate this metric on zero/few shot classification and clustering tasks and typically find 10-15 % better performance compared to raw cosine similarity. Our code is available at https://github.com/MeetElise/surprise-similarity.

information retrieval, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2308.09765

Country:

North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.67)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification

Refai, Dania, Abo-Soud, Saleh, Abdel-Rahman, Mohammad

arXiv.org Artificial IntelligenceAug-20-2023

The performance of learning models heavily relies on the availability and adequacy of training data. To address the dataset adequacy issue, researchers have extensively explored data augmentation (DA) as a promising approach. DA generates new data instances through transformations applied to the available data, thereby increasing dataset size and variability. This approach has enhanced model performance and accuracy, particularly in addressing class imbalance problems in classification tasks. However, few studies have explored DA for the Arabic language, relying on traditional approaches such as paraphrasing or noising-based techniques. In this paper, we propose a new Arabic DA method that employs the recent powerful modeling technique, namely the AraGPT-2, for the augmentation process. The generated sentences are evaluated in terms of context, semantics, diversity, and novelty using the Euclidean, cosine, Jaccard, and BLEU distances. Finally, the AraBERT transformer is used on sentiment classification tasks to evaluate the classification performance of the augmented Arabic dataset. The experiments were conducted on four sentiment Arabic datasets: AraSarcasm, ASTD, ATT, and MOVIE. The selected datasets vary in size, label number, and unbalanced classes. The results show that the proposed methodology enhanced the Arabic sentiment text classification on all datasets with an increase in F1 score by 4% in AraSarcasm, 6% in ASTD, 9% in ATT, and 13% in MOVIE.

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

2212.13939

Country:

Asia > Middle East > Jordan > Amman Governorate > Amman (0.04)
Europe > Netherlands > South Holland > Rotterdam (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Optimizing Multi-Class Text Classification: A Diverse Stacking Ensemble Framework Utilizing Transformers

Krishnan, Anusuya

arXiv.org Artificial IntelligenceAug-19-2023

Customer reviews play a crucial role in assessing customer satisfaction, gathering feedback, and driving improvements for businesses. Analyzing these reviews provides valuable insights into customer sentiments, including compliments, comments, and suggestions. Text classification techniques enable businesses to categorize customer reviews into distinct categories, facilitating a better understanding of customer feedback. However, challenges such as overfitting and bias limit the effectiveness of a single classifier in ensuring optimal prediction. This study proposes a novel approach to address these challenges by introducing a stacking ensemble-based multi-text classification method that leverages transformer models. By combining multiple single transformers, including BERT, ELECTRA, and DistilBERT, as base-level classifiers, and a meta-level classifier based on RoBERTa, an optimal predictive model is generated. The proposed stacking ensemble-based multi-text classification method aims to enhance the accuracy and robustness of customer review analysis. Experimental evaluations conducted on a real-world customer review dataset demonstrate the effectiveness and superiority of the proposed approach over traditional single classifier models. The stacking ensemble-based multi-text classification method using transformers proves to be a promising solution for businesses seeking to extract valuable insights from customer reviews and make data-driven decisions to enhance customer satisfaction and drive continuous improvement.

accuracy, classification, classifier, (11 more...)

arXiv.org Artificial Intelligence

2308.11519

Country:

Asia > Middle East > UAE (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Promising Solution (0.68)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification

Razuvayevskaya, Olesya, Wu, Ben, Leite, Joao A., Heppell, Freddy, Srba, Ivan, Scarton, Carolina, Bontcheva, Kalina, Song, Xingyi

arXiv.org Artificial IntelligenceAug-14-2023

Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient. Previous results demonstrated that these methods can even improve performance on some classification tasks. This paper complements the existing research by investigating how these techniques influence the classification performance and computation costs compared to full fine-tuning when applied to multilingual text classification tasks (genre, framing, and persuasion techniques detection; with different input lengths, number of predicted classes and classification difficulty), some of which have limited training data. In addition, we conduct in-depth analyses of their efficacy across different training scenarios (training on the original multilingual data; on the translations into English; and on a subset of English-only data) and different languages. Our findings provide valuable insights into the applicability of the parameter-efficient fine-tuning techniques, particularly to complex multilingual and multilabel classification tasks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2308.07282

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Asia > China > Hong Kong (0.04)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry: Media > News (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.48)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

A Bi-directional Multi-hop Inference Model for Joint Dialog Sentiment Classification and Act Recognition

Zheng, Li, Li, Fei, Chai, Yuyang, Teng, Chong, Ji, Donghong

arXiv.org Artificial IntelligenceAug-12-2023

The joint task of Dialog Sentiment Classification (DSC) and Act Recognition (DAR) aims to predict the sentiment label and act label for each utterance in a dialog simultaneously. However, current methods encode the dialog context in only one direction, which limits their ability to thoroughly comprehend the context. Moreover, these methods overlook the explicit correlations between sentiment and act labels, which leads to an insufficient ability to capture rich sentiment and act clues and hinders effective and accurate reasoning. To address these issues, we propose a Bi-directional Multi-hop Inference Model (BMIM) that leverages a feature selection network and a bi-directional multi-hop inference network to iteratively extract and integrate rich sentiment and act clues in a bi-directional manner. We also employ contrastive learning and dual learning to explicitly model the correlations of sentiment and act labels. Our experiments on two widely-used datasets show that BMIM outperforms state-of-the-art baselines by at least 2.6% on F1 score in DAR and 1.4% on F1 score in DSC. Additionally, Our proposed model not only improves the performance but also enhances the interpretability of the joint sentiment and act prediction task.

correlation, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2308.04424

Country: Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.87)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.87)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.62)

Add feedback

Weakly Supervised Text Classification on Free Text Comments in Patient-Reported Outcome Measures

Linton, Anna-Grace, Dimitrova, Vania, Downing, Amy, Wagland, Richard, Glaser, Adam

arXiv.org Artificial IntelligenceAug-11-2023

Free text comments (FTC) in patient-reported outcome measures (PROMs) data are typically analysed using manual methods, such as content analysis, which is labour-intensive and time-consuming. Machine learning analysis methods are largely unsupervised, necessitating post-analysis interpretation. Weakly supervised text classification (WSTC) can be a valuable method of analysis to classify domain-specific text data in which there is limited labelled data. In this paper, we apply five WSTC techniques to FTC in PROMs data to identify health-related quality of life (HRQoL) themes reported by colorectal cancer patients. The WSTC methods label all the themes mentioned in the FTC. The results showed moderate performance on the PROMs data, mainly due to the precision of the models, and variation between themes. Evaluation of the classification performance illustrated the potential and limitations of keyword based WSTC to label PROMs FTC when labelled data is limited.

data mining, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2308.06199

Country:

Europe > United Kingdom > England > West Yorkshire > Leeds (0.04)
Europe > United Kingdom > Wales (0.04)
Europe > United Kingdom > England > Hampshire > Southampton (0.04)
(7 more...)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Therapeutic Area > Oncology > Colorectal Cancer (0.35)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.87)

Add feedback

Transformers are Short Text Classifiers: A Study of Inductive Short Text Classifiers on Benchmarks and Real-world Datasets

Karl, Fabian, Scherp, Ansgar

arXiv.org Artificial IntelligenceAug-11-2023

Short text classification is a crucial and challenging aspect of Natural Language Processing. For this reason, there are numerous highly specialized short text classifiers. However, in recent short text research, State of the Art (SOTA) methods for traditional text classification, particularly the pure use of Transformers, have been unexploited. In this work, we examine the performance of a variety of short text classifiers as well as the top performing traditional text classifier. We further investigate the effects on two new real-world short text datasets in an effort to address the issue of becoming overly dependent on benchmark datasets with a limited number of characteristics. Our experiments unambiguously demonstrate that Transformers achieve SOTA accuracy on short text classification tasks, raising the question of whether specialized short text techniques are necessary.

classification, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2211.16878

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New York > New York County > New York City (0.04)
(12 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.95)

Add feedback