AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Towards Systematic Monolingual NLP Surveys: GenA of Greek NLP

Bakagianni, Juli, Pouli, Kanella, Gavriilidou, Maria, Pavlopoulos, John

arXiv.org Artificial IntelligenceJul-13-2024

Natural Language Processing (NLP) research has traditionally been predominantly focused on English, driven by the availability of resources, the size of the research community, and market demands. Recently, there has been a noticeable shift towards multilingualism in NLP, recognizing the need for inclusivity and effectiveness across diverse languages and cultures. Monolingual surveys have the potential to complement the broader trend towards multilingualism in NLP by providing foundational insights and resources necessary for effectively addressing the linguistic diversity of global communication. However, monolingual NLP surveys are extremely rare in literature. This study fills the gap by introducing a method for creating systematic and comprehensive monolingual NLP surveys. Characterized by a structured search protocol, it can be used to select publications and organize them through a taxonomy of NLP tasks. We include a classification of Language Resources (LRs), according to their availability, and datasets, according to their annotation, to highlight publicly-available and machine-actionable LRs. By applying our method, we conducted a systematic literature review of Greek NLP from 2012 to 2022, providing a comprehensive overview of the current state and challenges of Greek NLP research. We discuss the progress of Greek NLP and outline encountered Greek LRs, classified by availability and usability. As we show, our proposed method helps avoid common pitfalls, such as data leakage and contamination, and to assess language support per NLP task. We consider this systematic literature review of Greek NLP an application of our method that showcases the benefits of a monolingual NLP survey. Similar applications could be regard the myriads of languages whose progress in NLP lags behind that of well-supported languages.

author availability ann, modern greek, nlp application, (16 more...)

arXiv.org Artificial Intelligence

2407.09861

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Europe > Slovenia (0.04)
(23 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Law (1.00)
Information Technology (1.00)
Education > Curriculum > Subject-Specific Education (0.92)
(7 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(8 more...)

Add feedback

Domain-Hierarchy Adaptation via Chain of Iterative Reasoning for Few-shot Hierarchical Text Classification

Ji, Ke, Wang, Peng, Ke, Wenjun, Li, Guozheng, Liu, Jiajun, Gao, Jingsheng, Shang, Ziyu

arXiv.org Artificial IntelligenceJul-11-2024

Recently, various pre-trained language models (PLMs) have been proposed to prove their impressive performances on a wide range of few-shot tasks. However, limited by the unstructured prior knowledge in PLMs, it is difficult to maintain consistent performance on complex structured scenarios, such as hierarchical text classification (HTC), especially when the downstream data is extremely scarce. The main challenge is how to transfer the unstructured semantic space in PLMs to the downstream domain hierarchy. Unlike previous work on HTC which directly performs multi-label classification or uses graph neural network (GNN) to inject label hierarchy, in this work, we study the HTC problem under a few-shot setting to adapt knowledge in PLMs from an unstructured manner to the downstream hierarchy. Technically, we design a simple yet effective method named Hierarchical Iterative Conditional Random Field (HierICRF) to search the most domain-challenging directions and exquisitely crafts domain-hierarchy adaptation as a hierarchical iterative language modeling problem, and then it encourages the model to make hierarchical consistency self-correction during the inference, thereby achieving knowledge transfer with hierarchical consistency preservation. We perform HierICRF on various architectures, and extensive experiments on two popular HTC datasets demonstrate that prompt with HierICRF significantly boosts the few-shot HTC performance with an average Micro-F1 by 28.80% to 1.50% and Macro-F1 by 36.29% to 1.5% over the previous state-of-the-art (SOTA) baselines under few-shot settings, while remaining SOTA hierarchical consistency performance.

classification, knowledge, text classification, (16 more...)

arXiv.org Artificial Intelligence

2407.08959

Country:

North America > United States (0.14)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.72)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)

Add feedback

NoisyAG-News: A Benchmark for Addressing Instance-Dependent Noise in Text Classification

Huang, Hongfei, Liang, Tingting, Sun, Xixi, Jin, Zikang, Yin, Yuyu

arXiv.org Artificial IntelligenceJul-9-2024

Existing research on learning with noisy labels predominantly focuses on synthetic label noise. Although synthetic noise possesses well-defined structural properties, it often fails to accurately replicate real-world noise patterns. In recent years, there has been a concerted effort to construct generalizable and controllable instance-dependent noise datasets for image classification, significantly advancing the development of noise-robust learning in this area. However, studies on noisy label learning for text classification remain scarce. To better understand label noise in real-world text classification settings, we constructed the benchmark dataset NoisyAG-News through manual annotation. Initially, we analyzed the annotated data to gather observations about real-world noise. We qualitatively and quantitatively demonstrated that real-world noisy labels adhere to instance-dependent patterns. Subsequently, we conducted comprehensive learning experiments on NoisyAG-News and its corresponding synthetic noise datasets using pre-trained language models and noise-handling techniques. Our findings reveal that while pre-trained models are resilient to synthetic noise, they struggle against instance-dependent noise, with samples of varying confusion levels showing inconsistent performance during training and testing. These real-world noise patterns pose new, significant challenges, prompting a reevaluation of noisy label handling methods. We hope that NoisyAG-News will facilitate the development and evaluation of future solutions for learning with noisy labels.

accuracy, dataset, noise, (14 more...)

arXiv.org Artificial Intelligence

2407.06579

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

New Directions in Text Classification Research: Maximizing The Performance of Sentiment Classification from Limited Data

Agustian, Surya, Syah, Muhammad Irfan, Fatiara, Nurul, Abdillah, Rahmad

arXiv.org Artificial IntelligenceJul-8-2024

The stakeholders' needs in sentiment analysis for various issues, whether positive or negative, are speed and accuracy. One new challenge in sentiment analysis tasks is the limited training data, which often leads to suboptimal machine learning models and poor performance on test data. This paper discusses the problem of text classification based on limited training data (300 to 600 samples) into three classes: positive, negative, and neutral. A benchmark dataset is provided for training and testing data on the issue of Kaesang Pangarep's appointment as Chairman of PSI. External data for aggregation and augmentation purposes are provided, consisting of two datasets: the topic of Covid Vaccination sentiment and an open topic. The official score used is the F1-score, which balances precision and recall among the three classes, positive, negative, and neutral. A baseline score is provided as a reference for researchers for unoptimized classification methods. The optimized score is provided as a reference for the target score to be achieved by any proposed method. Both scoring (baseline and optimized) use the SVM method, which is widely reported as the state-of-the-art in conventional machine learning methods. The F1-scores achieved by the baseline and optimized methods are 40.83% and 51.28%, respectively.

anotator, positif, text classification research, (12 more...)

arXiv.org Artificial Intelligence

2407.05627

Country:

North America > United States (0.05)
Asia > Indonesia > Sumatra > Riau (0.04)

Genre: Research Report (0.70)

Industry:

Health & Medicine > Therapeutic Area > Immunology (0.88)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.91)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

LegalTurk Optimized BERT for Multi-Label Text Classification and NER

Zeidi, Farnaz, Amasyali, Mehmet Fatih, Erol, Çiğdem

arXiv.org Artificial IntelligenceJun-30-2024

The introduction of the Transformer neural network, along with techniques like self-supervised pre-training and transfer learning, has paved the way for advanced models like BERT. Despite BERT's impressive performance, opportunities for further enhancement exist. To our knowledge, most efforts are focusing on improving BERT's performance in English and in general domains, with no study specifically addressing the legal Turkish domain. Our study is primarily dedicated to enhancing the BERT model within the legal Turkish domain through modifications in the pre-training phase. In this work, we introduce our innovative modified pre-training approach by combining diverse masking strategies. In the fine-tuning task, we focus on two essential downstream tasks in the legal domain: name entity recognition and multi-label text classification. To evaluate our modified pre-training approach, we fine-tuned all customized models alongside the original BERT models to compare their performance. Our modified approach demonstrated significant improvements in both NER and multi-label text classification tasks compared to the original BERT model. Finally, to showcase the impact of our proposed models, we trained our best models with different corpus sizes and compared them with BERTurk models. The experimental results demonstrate that our innovative approach, despite being pre-trained on a smaller corpus, competes with BERTurk.

corpus, multi-label text classification, text classification, (12 more...)

arXiv.org Artificial Intelligence

2407.00648

Country:

Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
Asia > Middle East > Iran (0.04)
(3 more...)

Genre: Research Report > New Finding (0.86)

Industry:

Law (1.00)
Health & Medicine (1.00)
Education > Educational Setting > Higher Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

DiffuseDef: Improved Robustness to Adversarial Attacks

Li, Zhenhao, Rei, Marek, Specia, Lucia

arXiv.org Artificial IntelligenceJun-28-2024

Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over different existing adversarial defense methods and achieves state-of-the-art performance against common adversarial attacks.

computational linguistic, diffusedef, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2407.00248

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.05)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(6 more...)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Learn it or Leave it: Module Composition and Pruning for Continual Learning

Wang, Mingyang, Adel, Heike, Lange, Lukas, Strötgen, Jannik, Schütze, Hinrich

arXiv.org Artificial IntelligenceJun-26-2024

In real-world environments, continual learning is essential for machine learning models, as they need to acquire new knowledge incrementally without forgetting what they have already learned. While pretrained language models have shown impressive capabilities on various static tasks, applying them to continual learning poses significant challenges, including avoiding catastrophic forgetting, facilitating knowledge transfer, and maintaining parameter efficiency. In this paper, we introduce MoCL-P, a novel lightweight continual learning method that addresses these challenges simultaneously. Unlike traditional approaches that continuously expand parameters for newly arriving tasks, MoCL-P integrates task representation-guided module composition with adaptive pruning, effectively balancing knowledge integration and computational overhead. Our evaluation across three continual learning benchmarks with up to 176 tasks shows that MoCL-P achieves state-of-the-art performance and improves parameter efficiency by up to three times, demonstrating its potential for practical applications where resource requirements are constrained.

benchmark, continual learning, module, (12 more...)

arXiv.org Artificial Intelligence

2406.18708

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(7 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.31)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.30)

Add feedback

AlleNoise -- large-scale text classification benchmark dataset with real-world label noise

Rączkowska, Alicja, Osowska-Kurczab, Aleksandra, Szczerbiński, Jacek, Jasinska-Kobus, Kalina, Nazarko, Klaudia

arXiv.org Artificial IntelligenceJun-24-2024

Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce marketplace, so it realistically reflects the semantics of human mistakes. In addition to the noisy labels, we provide human-verified clean labels, which help to get a deeper insight into the noise distribution, unlike web-scraped datasets typically used in the field. We demonstrate that a representative selection of established methods for learning with noisy labels is inadequate to handle such real-world noise. In addition, we show evidence that these algorithms do not alleviate excessive memorization. As such, with AlleNoise, we set the bar high for the development of label noise methods that can handle real-world label noise in text classification tasks. The code and dataset are available for download at https://github.com/allegro/AlleNoise.

allenoise, large-scale text classification benchmark dataset, real-world label noise

arXiv.org Artificial Intelligence

2407.10992

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability

Seegmiller, Parker, Gatto, Joseph, Preum, Sarah Masud

arXiv.org Artificial IntelligenceJun-20-2024

Recent evaluations of cross-domain text classification models aim to measure the ability of a model to obtain domain-invariant performance in a target domain given labeled samples in a source domain. The primary strategy for this evaluation relies on assumed differences between source domain samples and target domain samples in benchmark datasets. This evaluation strategy fails to account for the similarity between source and target domains, and may mask when models fail to transfer learning to specific target samples which are highly dissimilar from the source domain. We introduce Depth $F_1$, a novel cross-domain text classification performance metric. Designed to be complementary to existing classification metrics such as $F_1$, Depth $F_1$ measures how well a model performs on target samples which are dissimilar from the source domain. We motivate this metric using standard cross-domain text classification datasets and benchmark several recent cross-domain text classification models, with the goal of enabling in-depth evaluation of the semantic generalizability of cross-domain text classification models.

df 1, source domain, target domain, (14 more...)

arXiv.org Artificial Intelligence

2406.14695

Country:

North America > United States > Virginia (0.04)
North America > United States > New Hampshire > Grafton County > Hanover (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation

Pecher, Branislav, Cegin, Jan, Belanec, Robert, Simko, Jakub, Srba, Ivan, Bielikova, Maria

arXiv.org Artificial IntelligenceJun-18-2024

While fine-tuning of pre-trained language models generally helps to overcome the lack of labelled training samples, it also displays model performance instability. This instability mainly originates from randomness in initialisation or data shuffling. To address this, researchers either modify the training process or augment the available samples, which typically results in increased computational costs. We propose a new mitigation strategy, called Delayed Ensemble with Noisy Interpolation (DENI), that leverages the strengths of ensembling, noise regularisation and model interpolation, while retaining computational efficiency. We compare DENI with 9 representative mitigation strategies across 3 models, 4 tuning strategies and 7 text classification datasets. We show that: 1) DENI outperforms the best performing mitigation strategy (Ensemble), while using only a fraction of its cost; 2) the mitigation strategies are beneficial for parameter-efficient fine-tuning (PEFT) methods, outperforming full fine-tuning in specific cases; and 3) combining DENI with data augmentation often leads to even more effective instability mitigation.

dataset, fine-tuning method, mitigation strategy, (14 more...)

arXiv.org Artificial Intelligence

2406.12471

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
North America > Dominican Republic (0.04)
(8 more...)

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.66)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback