AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning

Mai, Phan Quoc Hung, Nguyen, Quang Hung, Duong, Phuong Giang, Nguyen, Hong Hanh, Long, Nguyen Tuan

arXiv.org Artificial IntelligenceJul-1-2025

In the field of education, understanding students' opinions through their comments is crucial, especially in the Vietnamese language, where resources remain limited. Existing educational datasets often lack domain relevance and student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese dataset for Educational Sentiment Classification and Topic Classification, curated from university forums, which offers more samples, richer class diversity, longer texts, and broader vocabulary. In addition, we explore multitask learning using encoder-only language models (BERT), in which we showed that it achieves performance up to 83.7% and 79.8% accuracy for sentiment and topic classification tasks. We also benchmark our dataset and model with other datasets and models, including Large Language Models, and discuss these benchmarks. The dataset is publicly available at: https://huggingface.co/datasets/hung20gg/NEU-ESC.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.23524

Country: Asia (0.94)

Genre: Research Report (0.82)

Industry:

Education > Educational Setting > Online (0.46)
Education > Educational Technology > Educational Software (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
(2 more...)

Add feedback

Small Language Models in the Real World: Insights from Industrial Text Classification

Li, Lujun, Sleem, Lama, Gentile, Niccolo', Nichil, Geoffrey, State, Radu

arXiv.org Artificial IntelligenceJun-25-2025

With the emergence of ChatGPT, Transformer models have significantly advanced text classification and related tasks. Decoder-only models such as Llama exhibit strong performance and flexibility, yet they suffer from inefficiency on inference due to token-by-token generation, and their effectiveness in text classification tasks heavily depends on prompt quality. Moreover, their substantial GPU resource requirements often limit widespread adoption. Thus, the question of whether smaller language models are capable of effectively handling text classification tasks emerges as a topic of significant interest. However, the selection of appropriate models and methodologies remains largely underexplored. In this paper, we conduct a comprehensive evaluation of prompt engineering and supervised fine-tuning methods for transformer-based text classification. Specifically, we focus on practical industrial scenarios, including email classification, legal document categorization, and the classification of extremely long academic texts. We examine the strengths and limitations of smaller models, with particular attention to both their performance and their efficiency in Video Random-Access Memory (VRAM) utilization, thereby providing valuable insights for the local deployment and application of compact models in industrial settings.

classification, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2505.16078

Country: Europe (1.00)

Genre: Research Report > New Finding (0.94)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

Kokate, Prathamesh, Sarnaik, Mitali, Khopade, Manavi, Joshi, Raviraj

arXiv.org Artificial IntelligenceJun-24-2025

Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.

classification, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.07248

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ETS: Open Vocabulary Electroencephalography-To-Text Decoding and Sentiment Classification

Masry, Mohamed, Amen, Mohamed, Elzyat, Mohamed, Hamed, Mohamed, Magdy, Norhan, Khaled, Maram

arXiv.org Artificial IntelligenceJun-19-2025

Decoding natural language from brain activity using non-invasive electroencephalography (EEG) remains a significant challenge in neuroscience and machine learning, particularly for open-vocabulary scenarios where traditional methods struggle with noise and variability. Previous studies have achieved high accuracy on small-closed vocabularies, but it still struggles on open vocabularies. In this study, we propose ETS, a framework that integrates EEG with synchronized eye-tracking data to address two critical tasks: (1) open-vocabulary text generation and (2) sentiment classification of perceived language. Our model achieves a superior performance on BLEU and Rouge score for EEG-To-Text decoding and up to 10% F1 score on EEG-based ternary sentiment classification, which significantly outperforms supervised baselines. Furthermore, we show that our proposed model can handle data from various subjects and sources, showing great potential for high performance open vocabulary eeg-to-text system.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2506.14783

Country: North America > United States > California (0.28)

Genre: Research Report (0.85)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.69)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.61)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.84)

Add feedback

Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages

Almutairi, Ali, Alsuhaibani, Abdullah, Jameel, Shoaib, Naseem, Usman, Mohammadi, Gelareh, Razzak, Imran

arXiv.org Artificial IntelligenceJun-13-2025

Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick's efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.10292

Country:

Europe (0.46)
Oceania > Australia (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.86)

Add feedback

TokenBreak: Bypassing Text Classification Models Through Token Manipulation

Schulz, Kasimir, Yeung, Kenneth, Evans, Kieran

arXiv.org Artificial IntelligenceJun-10-2025

Natural Language Processing (NLP) models are used for text-related tasks such as classification and generation. To complete these tasks, input data is first tokenized from human-readable text into a format the model can understand, enabling it to make inferences and understand context. Text classification models can be implemented to guard against threats such as prompt injection attacks against Large Language Models (LLMs), toxic input and cybersecurity risks such as spam emails. In this paper, we introduce TokenBreak: a novel attack that can bypass these protection models by taking advantage of the tokenization strategy they use. This attack technique manipulates input text in such a way that certain models give an incorrect classification. Importantly, the end target (LLM or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the protection model was put in place to prevent. The tokenizer is tied to model architecture, meaning it is possible to predict whether or not a model is vulnerable to attack based on family. We also present a defensive strategy as an added layer of protection that can be implemented without having to retrain the defensive model.

detection model, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.07948

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.61)

Add feedback

Quantum Graph Transformer for NLP Sentiment Classification

Aktar, Shamminuj, Bärtschi, Andreas, Badawy, Abdel-Hameed A., Eidenbenz, Stephan

arXiv.org Artificial IntelligenceJun-10-2025

Quantum machine learning is a promising direction for building more efficient and expressive models, particularly in domains where understanding complex, structured data is critical. We present the Quantum Graph Transformer (QGT), a hybrid graph-based architecture that integrates a quantum self-attention mechanism into the message-passing framework for structured language modeling. The attention mechanism is implemented using parameterized quantum circuits (PQCs), which enable the model to capture rich contextual relationships while significantly reducing the number of trainable parameters compared to classical attention mechanisms. We evaluate QGT on five sentiment classification benchmarks. Experimental results show that QGT consistently achieves higher or comparable accuracy than existing quantum natural language processing (QNLP) models, including both attention-based and non-attention-based approaches. When compared with an equivalent classical graph transformer, QGT yields an average accuracy improvement of 5.42% on real-world datasets and 4.76% on synthetic datasets. Additionally, QGT demonstrates improved sample efficiency, requiring nearly 50% fewer labeled samples to reach comparable performance on the Yelp dataset. These results highlight the potential of graph-based QNLP techniques for advancing efficient and scalable language understanding.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.07937

Country: North America > United States > New Mexico (0.29)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.87)
(5 more...)

Add feedback

Fine-grained Hierarchical Crop Type Classification from Integrated Hyperspectral EnMAP Data and Multispectral Sentinel-2 Time Series: A Large-scale Dataset and Dual-stream Transformer Method

Li, Wenyuan, Liang, Shunlin, Zhang, Yuxiang, Liu, Liqin, Chen, Keyan, Chen, Yongzhe, Ma, Han, Xu, Jianglei, Ma, Yichuan, Guan, Shikang, Shi, Zhenwei

arXiv.org Artificial IntelligenceJun-10-2025

Fine-grained crop type classification serves as the fundamental basis for large-scale crop mapping and plays a vital role in ensuring food security. It requires simultaneous capture of both phenological dynamics (obtained from multi-temporal satellite data like Sentinel-2) and subtle spectral variations (demanding nanometer-scale spectral resolution from hyperspectral imagery). Research combining these two modalities remains scarce currently due to challenges in hyperspectral data acquisition and crop types annotation costs. To address these issues, we construct a hierarchical hyperspectral crop dataset (H2Crop) by integrating 30m-resolution EnMAP hyperspectral data with Sentinel-2 time series. With over one million annotated field parcels organized in a four-tier crop taxonomy, H2Crop establishes a vital benchmark for fine-grained agricultural crop classification and hyperspectral image processing. We propose a dual-stream Transformer architecture that synergistically processes these modalities. It coordinates two specialized pathways: a spectral-spatial Transformer extracts fine-grained signatures from hyperspectral EnMAP data, while a temporal Swin Transformer extracts crop growth patterns from Sentinel-2 time series. The designed hierarchical classification head with hierarchical fusion then simultaneously delivers multi-level crop type classification across all taxonomic tiers. Experiments demonstrate that adding hyperspectral EnMAP data to Sentinel-2 time series yields a 4.2% average F1-scores improvement (peaking at 6.3%). Extensive comparisons also confirm our method's higher accuracy over existing deep learning approaches for crop type classification and the consistent benefits of hyperspectral data across varying temporal windows and crop change scenarios. Codes and dataset are available at https://github.com/flyakon/H2Crop.

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

2506.06155

Country:

Europe (0.46)
Asia > China (0.28)

Genre: Research Report (1.00)

Industry: Food & Agriculture > Agriculture (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hierarchical Text Classification Using Contrastive Learning Informed Path Guided Hierarchy

Agrawal, Neeraj, Kumar, Saurabh, Bhatt, Priyanka, Agarwal, Tanishka

arXiv.org Artificial IntelligenceJun-6-2025

Hierarchical Text Classification (HTC) has recently gained traction given the ability to handle complex label hierarchy. This has found applications in domains like E- commerce, customer care and medicine industry among other real-world applications. Existing HTC models either encode label hierarchy separately and mix it with text encoding or guide the label hierarchy structure in the text encoder. Both approaches capture different characteristics of label hierarchy and are complementary to each other. In this paper, we propose a Hierarchical Text Classification using Contrastive Learning Informed Path guided hierarchy (HTC-CLIP), which learns hierarchy-aware text representation and text informed path guided hierarchy representation using contrastive learning. During the training of HTC-CLIP, we learn two different sets of class probabilities distributions and during inference, we use the pooled output of both probabilities for each class to get the best of both representations. Our results show that the two previous approaches can be effectively combined into one architecture to achieve improved performance. Tests on two public benchmark datasets showed an improvement of 0.99 - 2.37% in Macro F1 score using HTC-CLIP over the existing state-of-the-art models.

machine learning, natural language, text classification, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.3233/FAIA230249

2506.04381

Country:

Europe (0.93)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.54)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Automatic classification of stop realisation with wav2vec2.0

Tanner, James, Sonderegger, Morgan, Stuart-Smith, Jane, Mielke, Jeff, Kendall, Tyler

arXiv.org Artificial IntelligenceJun-2-2025

Modern phonetic research regularly makes use of automatic tools for the annotation of speech data, however few tools exist for the annotation of many variable phonetic phenomena. At the same time, pre-trained self-supervised models, such as wav2vec2.0, have been shown to perform well at speech classification tasks and latently encode fine-grained phonetic information. We demonstrate that wav2vec2.0 models can be trained to automatically classify stop burst presence with high accuracy in both English and Japanese, robust across both finely-curated and unprepared speech corpora. Patterns of variability in stop realisation are replicated with the automatic annotations, and closely follow those of manual annotations. These results demonstrate the potential of pre-trained speech models as tools for the automatic annotation and processing of speech corpus data, enabling researchers to 'scale-up' the scope of phonetic research with relative ease.

annotation, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.23688

Country:

North America > United States (0.28)
Europe > Austria (0.28)
North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.40)

Add feedback