AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Stochastic Adversarial Networks for Multi-Domain Text Classification

Wang, Xu, Wu, Yuan

arXiv.org Artificial IntelligenceMay-27-2024

Adversarial training has been instrumental in advancing multi-domain text classification (MDTC). Traditionally, MDTC methods employ a shared-private paradigm, with a shared feature extractor for domain-invariant knowledge and individual private feature extractors for domain-specific knowledge. Despite achieving state-of-the-art results, these methods grapple with the escalating model parameters due to the continuous addition of new domains. To address this challenge, we introduce the Stochastic Adversarial Network (SAN), which innovatively models the parameters of the domain-specific feature extractor as a multivariate Gaussian distribution, as opposed to a traditional weight vector. This design allows for the generation of numerous domain-specific feature extractors without a substantial increase in model parameters, maintaining the model's size on par with that of a single domain-specific extractor. Furthermore, our approach integrates domain label smoothing and robust pseudo-label regularization to fortify the stability of adversarial training and to refine feature discriminability, respectively. The performance of our SAN, evaluated on two leading MDTC benchmarks, demonstrates its competitive edge against the current state-of-the-art methodologies. The code is available at https://github.com/wangxu0820/SAN.

classification, extractor, feature extractor, (11 more...)

arXiv.org Artificial Intelligence

2406.00044

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

ChronosLex: Time-aware Incremental Training for Temporal Generalization of Legal Classification Tasks

Santosh, T. Y. S. S, Vuong, Tuan-Quang, Grabmair, Matthias

arXiv.org Artificial IntelligenceMay-23-2024

This study investigates the challenges posed by the dynamic nature of legal multi-label text classification tasks, where legal concepts evolve over time. Existing models often overlook the temporal dimension in their training process, leading to suboptimal performance of those models over time, as they treat training data as a single homogeneous block. To address this, we introduce ChronosLex, an incremental training paradigm that trains models on chronological splits, preserving the temporal order of the data. However, this incremental approach raises concerns about overfitting to recent data, prompting an assessment of mitigation strategies using continual learning and temporal invariant methods. Our experimental results over six legal multi-label text classification datasets reveal that continual learning methods prove effective in preventing overfitting thereby enhancing temporal generalizability, while temporal invariant methods struggle to capture these dynamics of temporal shifts.

computational linguistic, dataset, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2405.14211

Country:

North America > United States (0.46)
Europe > United Kingdom (0.14)
Europe > Austria > Vienna (0.14)
(6 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Government > Regional Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.87)

Add feedback

Simple-Sampling and Hard-Mixup with Prototypes to Rebalance Contrastive Learning for Text Classification

Li, Mengyu, Liu, Yonghao, Giunchiglia, Fausto, Feng, Xiaoyue, Guan, Renchu

arXiv.org Artificial IntelligenceMay-19-2024

Text classification is a crucial and fundamental task in natural language processing. Compared with the previous learning paradigm of pre-training and fine-tuning by cross entropy loss, the recently proposed supervised contrastive learning approach has received tremendous attention due to its powerful feature learning capability and robustness. Although several studies have incorporated this technique for text classification, some limitations remain. First, many text datasets are imbalanced, and the learning mechanism of supervised contrastive learning is sensitive to data imbalance, which may harm the model performance. Moreover, these models leverage separate classification branch with cross entropy and supervised contrastive learning branch without explicit mutual guidance. To this end, we propose a novel model named SharpReCL for imbalanced text classification tasks. First, we obtain the prototype vector of each class in the balanced classification branch to act as a representation of each class. Then, by further explicitly leveraging the prototype vectors, we construct a proper and sufficient target sample set with the same size for each class to perform the supervised contrastive learning procedure. The empirical results show the effectiveness of our model, which even outperforms popular large language models across several datasets.

classification, dataset, prototype vector, (15 more...)

arXiv.org Artificial Intelligence

2405.11524

Genre:

Research Report > Promising Solution (0.34)
Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents

Grosjean, Juri, Vamvas, Jannis

arXiv.org Artificial IntelligenceMay-13-2024

Encoder models trained for the embedding of sentences or short documents have proven useful for tasks such as semantic search and topic modeling. In this paper, we present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose. SwissBERT contains language adapters for the four national languages of Switzerland -- German, French, Italian, and Romansh -- and has been pre-trained on a large number of news articles in those languages. Using contrastive learning based on a subset of these articles, we trained a fine-tuned version, which we call SentenceSwissBERT. Multilingual experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model and of a comparable baseline. The model is openly available for research use.

dataset, proceedings, swissbert, (15 more...)

arXiv.org Artificial Intelligence

2405.07513

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Switzerland > Neuchâtel > Neuchâtel (0.04)
(7 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.35)

Add feedback

Improving Long Text Understanding with Knowledge Distilled from Summarization Model

Liu, Yan, Yang, Yazheng, Chen, Xiaokang

arXiv.org Artificial IntelligenceMay-8-2024

Long text understanding is important yet challenging for natural language processing. A long article or document usually contains many redundant words that are not pertinent to its gist and sometimes can be regarded as noise. With recent advances of abstractive summarization, we propose our \emph{Gist Detector} to leverage the gist detection ability of a summarization model and integrate the extracted gist into downstream models to enhance their long text understanding ability. Specifically, Gist Detector first learns the gist detection knowledge distilled from a summarization model, and then produces gist-aware representations to augment downstream models. We evaluate our method on three different tasks: long document classification, distantly supervised open-domain question answering, and non-parallel text style transfer. The experimental results show that our method can significantly improve the performance of baseline models on all tasks.

gist detector, information, summarization model, (13 more...)

arXiv.org Artificial Intelligence

2405.04955

Country:

Asia > China > Hong Kong (0.04)
South America > Brazil (0.04)
Europe > Spain (0.04)
(2 more...)

Genre: Research Report (0.70)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.35)

Add feedback

The Trade-off between Performance, Efficiency, and Fairness in Adapter Modules for Text Classification

Bui, Minh Duc, von der Wense, Katharina

arXiv.org Artificial IntelligenceMay-3-2024

Current natural language processing (NLP) research tends to focus on only one or, less frequently, two dimensions - e.g., performance, privacy, fairness, or efficiency - at a time, which may lead to suboptimal conclusions and often overlooking the broader goal of achieving trustworthy NLP. Work on adapter modules (Houlsby et al., 2019; Hu et al., 2021) focuses on improving performance and efficiency, with no investigation of unintended consequences on other aspects such as fairness. To address this gap, we conduct experiments on three text classification datasets by either (1) finetuning all parameters or (2) using adapter modules. Regarding performance and efficiency, we confirm prior findings that the accuracy of adapter-enhanced models is roughly on par with that of fully finetuned models, while training time is substantially reduced. Regarding fairness, we show that adapter modules result in mixed fairness across sensitive groups. Further investigation reveals that, when the standard fine-tuned model exhibits limited biases, adapter modules typically do not introduce extra bias. On the other hand, when the finetuned model exhibits increased bias, the impact of adapter modules on bias becomes more unpredictable, introducing the risk of significantly magnifying these biases for certain groups. Our findings highlight the need for a case-by-case evaluation rather than a one-size-fits-all judgment.

computational linguistic, dataset, fairness, (14 more...)

arXiv.org Artificial Intelligence

2405.0201

Country:

North America > United States > Washington > King County > Seattle (0.14)
Europe > Germany > Rheinland-Pfalz > Mainz (0.05)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.61)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

UniGen: Universal Domain Generalization for Sentiment Classification via Zero-shot Dataset Generation

Choi, Juhwan, Kim, Yeonghwa, Yu, Seunguk, Yun, JungMin, Kim, YoungBin

arXiv.org Artificial IntelligenceMay-2-2024

Although pre-trained language models have exhibited great flexibility and versatility with prompt-based few-shot learning, they suffer from the extensive parameter size and limited applicability for inference. Recent studies have suggested that PLMs be used as dataset generators and a tiny task-specific model be trained to achieve efficient inference. However, their applicability to various domains is limited because they tend to generate domain-specific datasets. In this work, we propose a novel approach to universal domain generalization that generates a dataset regardless of the target domain. This allows for generalization of the tiny task model to any domain that shares the label space, thus enhancing the real-world applicability of the dataset generation paradigm. Our experiments indicate that the proposed method accomplishes generalizability across various domains while using a parameter set that is orders of magnitude smaller than PLMs.

dataset, proceedings, tam, (15 more...)

arXiv.org Artificial Intelligence

2405.01022

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (0.68)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.64)
(2 more...)

Add feedback

AutoGluon-Multimodal (AutoMM): Supercharging Multimodal AutoML with Foundation Models

Tang, Zhiqiang, Fang, Haoyang, Zhou, Su, Yang, Taojiannan, Zhong, Zihan, Hu, Tony, Kirchhoff, Katrin, Karypis, George

arXiv.org Artificial IntelligenceApr-30-2024

AutoGluon-Multimodal (AutoMM) is introduced as an open-source AutoML library designed specifically for multimodal learning. Distinguished by its exceptional ease of use, AutoMM enables fine-tuning of foundation models with just three lines of code. Supporting various modalities including image, text, and tabular data, both independently and in combination, the library offers a comprehensive suite of functionalities spanning classification, regression, object detection, semantic matching, and image segmentation. Experiments across diverse datasets and tasks showcases AutoMM's superior performance in basic classification and regression tasks compared to existing AutoML tools, while also demonstrating competitive results in advanced tasks, aligning with specialized toolboxes designed for such purposes.

arxiv preprint arxiv, automm, dataset, (13 more...)

arXiv.org Artificial Intelligence

2404.16233

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Banking & Finance (0.67)
Health & Medicine > Diagnostic Medicine > Imaging (0.46)
Information Technology > Services (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
(6 more...)

Add feedback

Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression

Wan, Li, Alpcan, Tansu, Kuijper, Margreta, Viterbo, Emanuele

arXiv.org Artificial IntelligenceApr-28-2024

We propose a novel, lightweight supervised dictionary learning framework for text classification based on data compression and representation. This two-phase algorithm initially employs the Lempel-Ziv-Welch (LZW) algorithm to construct a dictionary from text datasets, focusing on the conceptual significance of dictionary elements. Subsequently, dictionaries are refined considering label data, optimizing dictionary atoms to enhance discriminative power based on mutual information and class distribution. This process generates discriminative numerical representations, facilitating the training of simple classifiers such as SVMs and neural networks. We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance. Tested on six benchmark text datasets, our algorithm competes closely with top models, especially in limited-vocabulary contexts, using significantly fewer parameters. \review{Our algorithm closely matches top-performing models, deviating by only ~2\% on limited-vocabulary datasets, using just 10\% of their parameters. However, it falls short on diverse-vocabulary datasets, likely due to the LZW algorithm's constraints with low-repetition data. This contrast highlights its efficiency and limitations across different dataset types.

algorithm, classification, dataset, (16 more...)

arXiv.org Artificial Intelligence

2405.01584

Country:

Oceania > Australia (0.14)
North America > United States > New York (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

L3Cube-MahaNews: News-based Short Text and Long Document Classification Datasets in Marathi

Mittal, Saloni, Magdum, Vidula, Dhekane, Omkar, Hiwarkhedkar, Sharayu, Joshi, Raviraj

arXiv.org Artificial IntelligenceApr-28-2024

The availability of text or topic classification datasets in the low-resource Marathi language is limited, typically consisting of fewer than 4 target labels, with some achieving nearly perfect accuracy. In this work, we introduce L3Cube-MahaNews, a Marathi text classification corpus that focuses on News headlines and articles. This corpus stands out as the largest supervised Marathi Corpus, containing over 1.05L records classified into a diverse range of 12 categories. To accommodate different document lengths, MahaNews comprises three supervised datasets specifically designed for short text, long documents, and medium paragraphs. The consistent labeling across these datasets facilitates document length-based analysis. We provide detailed data statistics and baseline results on these datasets using state-of-the-art pre-trained BERT models. We conduct a comparative analysis between monolingual and multilingual BERT models, including MahaBERT, IndicBERT, and MuRIL. The monolingual MahaBERT model outperforms all others on every dataset. These resources also serve as Marathi topic classification datasets or models and are publicly available at https://github.com/l3cube-pune/MarathiNLP .

classification, dataset, text classification, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-58495-4_4

2404.18216

Country:

Asia > Middle East > UAE > Dubai Emirate > Dubai (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)
Asia > India > Maharashtra > Pune (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback