AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Self-Regulated Data-Free Knowledge Amalgamation for Text Classification

Vijayaraghavan, Prashanth, Wang, Hongzhi, Shi, Luyao, Baldwin, Tyler, Beymer, David, Degan, Ehsan

arXiv.org Artificial IntelligenceJun-16-2024

Recently, there has been a growing availability of pre-trained text models on various model repositories. These models greatly reduce the cost of training new models from scratch as they can be fine-tuned for specific tasks or trained on large datasets. However, these datasets may not be publicly accessible due to the privacy, security, or intellectual property issues. In this paper, we aim to develop a lightweight student network that can learn from multiple teacher models without accessing their original training data. Hence, we investigate Data-Free Knowledge Amalgamation (DFKA), a knowledge-transfer task that combines insights from multiple pre-trained teacher models and transfers them effectively to a compact student network. To accomplish this, we propose STRATANET, a modeling framework comprising: (a) a steerable data generator that produces text data tailored to each teacher and (b) an amalgamation module that implements a self-regulative strategy using confidence estimates from the teachers' different layers to selectively integrate their knowledge and train a versatile student. We evaluate our method on three benchmark text classification datasets with varying labels or domains. Empirically, we demonstrate that the student model learned using our STRATANET outperforms several baselines significantly under data-driven and data-free constraints.

knowledge, student model, teacher model, (14 more...)

arXiv.org Artificial Intelligence

2406.15476

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Santa Clara County > San Jose (0.05)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Leisure & Entertainment (0.68)
Education (0.51)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.71)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval

Chavan, Rohan, Patil, Gaurav, Madle, Vishal, Joshi, Raviraj

arXiv.org Artificial IntelligenceJun-16-2024

Stopwords are commonly used words in a language that are often considered to be of little value in determining the meaning or significance of a document. These words occur frequently in most texts and don't provide much useful information for tasks like sentiment analysis and text classification. English, which is a high-resource language, takes advantage of the availability of stopwords, whereas low-resource Indian languages like Marathi are very limited, standardized, and can be used in available packages, but the number of available words in those packages is low. Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences. We make use of the TF-IDF approach coupled with human evaluation to curate a strong stopword list of 400 words. We apply the stop word removal to the text classification task and show its efficacy. The work also presents a simple recipe for stopword curation in a low-resource language. The stopwords are integrated into the mahaNLP library and publicly available on https://github.com/l3cube-pune/MarathiNLP .

application, stopword, stopword list, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/I2CT61223.2024.10544359

2406.11029

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > India > Maharashtra > Pune (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.90)

Add feedback

Universal Cross-Lingual Text Classification

Savant, Riya, Shelke, Anushka, Todmal, Sakshi, Kanphade, Sanskruti, Joshi, Ananya, Joshi, Raviraj

arXiv.org Artificial IntelligenceJun-16-2024

Text classification, an integral task in natural language processing, involves the automatic categorization of text into predefined classes. Creating supervised labeled datasets for low-resource languages poses a considerable challenge. Unlocking the language potential of low-resource languages requires robust datasets with supervised labels. However, such datasets are scarce, and the label space is often limited. In our pursuit to address this gap, we aim to optimize existing labels/datasets in different languages. This research proposes a novel perspective on Universal Cross-Lingual Text Classification, leveraging a unified model across languages. Our approach involves blending supervised data from different languages during training to create a universal model. The supervised data for a target classification task might come from different languages covering different labels. The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages. We propose the usage of a strong multilingual SBERT as our base model, making our novel training strategy feasible. This strategy contributes to the adaptability and effectiveness of the model in cross-lingual language transfer scenarios, where it can categorize text in languages not encountered during training. Thus, the paper delves into the intricacies of cross-lingual text classification, with a particular focus on its application for low-resource languages, exploring methodologies and implications for the development of a robust and adaptable universal cross-lingual model.

classification, cross-lingual text classification, text classification, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/I2CT61223.2024.10543381

2406.11028

Country:

Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
Europe > Czechia > Prague (0.04)
Asia > India (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)

Add feedback

Robust Latent Representation Tuning for Image-text Classification

Sun, Hao, Song, Yu

arXiv.org Artificial IntelligenceJun-14-2024

Large models have demonstrated exceptional generalization capabilities in computer vision and natural language processing. Recent efforts have focused on enhancing these models with multimodal processing abilities. However, addressing the challenges posed by scenarios where one modality is absent remains a significant hurdle. In response to this issue, we propose a robust latent representation tuning method for large models. Specifically, our approach introduces a modality latent translation module to maximize the correlation between modalities, resulting in a robust representation. Following this, a newly designed fusion module is employed to facilitate information interaction between the modalities. Within this framework, common semantics are refined during training, and robust performance is achieved even in the absence of one modality. Importantly, our method maintains the frozen state of the image and text foundation models to preserve their capabilities acquired through large-scale pretraining. We conduct experiments on several public datasets, and the results underscore the effectiveness of our proposed method.

module, representation, robust representation, (14 more...)

arXiv.org Artificial Intelligence

2406.06048

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.51)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Schröder, Christopher, Heyer, Gerhard

arXiv.org Artificial IntelligenceJun-13-2024

Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. Here we investigate how self-training, a semi-supervised approach where a model is used to obtain pseudo-labels from the unlabeled data, can be used to improve the efficiency of active learning for text classification. Starting with an extensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we devise HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks, on which it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using only 25% of the data.

active learning, classification, computational linguistic, (14 more...)

arXiv.org Artificial Intelligence

2406.09206

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
North America > Canada > Ontario > Toronto (0.04)
(17 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

RIFF: Learning to Rephrase Inputs for Few-shot Fine-tuning of Language Models

Najafi, Saeed, Fyshe, Alona

arXiv.org Artificial IntelligenceJun-6-2024

Pre-trained Language Models (PLMs) can be accurately fine-tuned for downstream text processing tasks. Recently, researchers have introduced several parameter-efficient fine-tuning methods that optimize input prompts or adjust a small number of model parameters (e.g LoRA). In this study, we explore the impact of altering the input text of the original task in conjunction with parameter-efficient fine-tuning methods. To most effectively rewrite the input text, we train a few-shot paraphrase model with a Maximum-Marginal Likelihood objective. Using six few-shot text classification datasets, we show that enriching data with paraphrases at train and test time enhances the performance beyond what can be achieved with parameter-efficient fine-tuning alone. The code used for our experiments can be found at https://github.com/SaeedNajafi/RIFF.

computational linguistic, language model, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2403.02271

Country:

North America > United States > California (0.16)
North America > Canada > Alberta (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(14 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.66)

Industry: Leisure & Entertainment > Sports (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
(2 more...)

Add feedback

Why is "Problems" Predictive of Positive Sentiment? A Case Study of Explaining Unintuitive Features in Sentiment Classification

Qu, Jiaming, Arguello, Jaime, Wang, Yue

arXiv.org Artificial IntelligenceJun-5-2024

Explainable AI (XAI) algorithms aim to help users understand how a machine learning model makes predictions. To this end, many approaches explain which input features are most predictive of a target label. However, such explanations can still be puzzling to users (e.g., in product reviews, the word "problems" is predictive of positive sentiment). If left unexplained, puzzling explanations can have negative impacts. Explaining unintuitive associations between an input feature and a target label is an underexplored area in XAI research. We take an initial effort in this direction using unintuitive associations learned by sentiment classifiers as a case study. We propose approaches for (1) automatically detecting associations that can appear unintuitive to users and (2) generating explanations to help users understand why an unintuitive feature is predictive. Results from a crowdsourced study (N=300) found that our proposed approaches can effectively detect and explain predictive but unintuitive features in sentiment classification.

explanation, participant, sentiment, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3630106.3658547

2406.03594

Country:

North America > United States > Illinois > Cook County > Chicago (0.05)
North America > United States > North Carolina (0.04)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.95)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.70)

Add feedback

AI-based Classification of Customer Support Tickets: State of the Art and Implementation with AutoML

Truss, Mario, Boehm, Stephan

arXiv.org Artificial IntelligenceJun-3-2024

One of today's primary priorities of companies is to improve the Customer Experience (CX) to increase customer satisfaction and reduce churn. However, "just 2 percent of organizations reached the top stage of CX maturity [and] most organizations are in early stages of CX maturity" (Dorsey et al., 2022). According to a recent study by Qualtrics (2022), 47 percent of customers ranked support as the second most important area of improvement in CX. One major factor of customer satisfaction identified in recent research (e.g., Service Excellence Research Group, 2021) is the speed at which customer support answers customer inquiries. Demand for customer support is rising and often exceeds the supply of available support agents. Especially missing knowledge and multiple re-routings between support agents are major factors for delays in resolution time. Further research suggests that due to information overload, the quality of decisions decreases with the number of decisions (Hemp, 2009; Viegas et al., 2015). In most recent studies, lack of time and resources are mentioned as the main issues in customer support, which harm the performance and, ultimately, the customer experience (HubSpot, 2022; Serrano et al., 2021).

classification, dataset, ticket, (16 more...)

arXiv.org Artificial Intelligence

2406.01789

Country:

Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)
North America > United States > New York (0.04)
North America > United States > Hawaii (0.04)
(12 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

Yun, Jungmin, Kim, Mihyeon, Kim, Youngbin

arXiv.org Artificial IntelligenceJun-3-2024

Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts with all tokens, including the ones unfavorable to classification performance. To overcome these challenges, we propose integrating two strategies: token pruning and token combining. Token pruning eliminates less important tokens in the attention mechanism's key and value as they pass through the layers. Additionally, we adopt fuzzy logic to handle uncertainty and alleviate potential mispruning risks arising from an imbalanced distribution of each token's importance. Token combining, on the other hand, condenses input sequences into smaller sizes in order to further compress the model. By integrating these two approaches, we not only improve the model's performance but also reduce its computational demands. Experiments with various datasets demonstrate superior performance compared to baseline models, especially with the best improvement over the existing BERT model, achieving +5%p in accuracy and +5.6%p in F1 score. Additionally, memory cost is reduced to 0.61x, and a speedup of 1.64x is achieved.

combination token, pruning, transformer, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.findings-emnlp.909

2406.01283

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.34)

Add feedback

FinEmbedDiff: A Cost-Effective Approach of Classifying Financial Documents with Vector Sampling using Multi-modal Embedding Models

Biswas, Anjanava, Talukdar, Wrick

arXiv.org Artificial IntelligenceMay-28-2024

Accurate classification of multi-modal financial documents, containing text, tables, charts, and images, is crucial but challenging. Traditional text-based approaches often fail to capture the complex multi-modal nature of these documents. We propose FinEmbedDiff, a cost-effective vector sampling method that leverages pre-trained multi-modal embedding models to classify financial documents. Our approach generates multi-modal embedding vectors for documents, and compares new documents with pre-computed class embeddings using vector similarity measures. Evaluated on a large dataset, FinEmbedDiff achieves competitive classification accuracy compared to state-of-the-art baselines while significantly reducing computational costs. The method exhibits strong generalization capabilities, making it a practical and scalable solution for real-world financial applications.

classification, financial document, international research journal, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.56726/IRJMETS57269

2406.01618

Country: Asia > India (0.04)

Genre: Research Report > Promising Solution (0.68)

Industry:

Banking & Finance (1.00)
Information Technology > Software (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)
(3 more...)

Add feedback