AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Classical Out-of-Distribution Detection Methods Benchmark in Text Classification Tasks

Baran, Mateusz, Baran, Joanna, Wójcik, Mateusz, Zięba, Maciej, Gonczarek, Adam

arXiv.org Artificial IntelligenceJul-13-2023

State-of-the-art models can perform well in controlled environments, but they often struggle when presented with out-of-distribution (OOD) examples, making OOD detection a critical component of NLP systems. In this paper, we focus on highlighting the limitations of existing approaches to OOD detection in NLP. Specifically, we evaluated eight OOD detection methods that are easily integrable into existing NLP systems and require no additional OOD data or model modifications. One of our contributions is providing a well-structured research environment that allows for full reproducibility of the results. Additionally, our analysis shows that existing OOD detection methods for NLP tasks are not yet sufficiently sensitive to capture all samples characterized by various types of distributional shifts. Particularly challenging testing scenarios arise in cases of background shift and randomly shuffled word order within in domain texts. This highlights the need for future work to develop more effective OOD detection approaches for the NLP problems, and our work provides a well-defined foundation for further research in this area.

machine learning, natural language, text classification, (16 more...)

arXiv.org Artificial Intelligence

2307.07002

Country:

Europe > Poland > Lower Silesia Province > Wroclaw (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Oregon (0.04)
(3 more...)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.83)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Better Handling Coreference Resolution in Aspect Level Sentiment Classification by Fine-Tuning Language Models

Mullick, Dhruv, Ghanem, Bilal, Fyshe, Alona

arXiv.org Artificial IntelligenceJul-11-2023

Customer feedback is invaluable to companies as they refine their products. Monitoring customer feedback can be automated with Aspect Level Sentiment Classification (ALSC) which allows us to analyse specific aspects of the products in reviews. Large Language Models (LLMs) are the heart of many state-of-the-art ALSC solutions, but they perform poorly in some scenarios requiring Coreference Resolution (CR). In this work, we propose a framework to improve an LLM's performance on CR-containing reviews by fine tuning on highly inferential tasks. We show that the performance improvement is likely attributed to the improved model CR ability. We also release a new dataset that focuses on CR in ALSC.

computational linguistic, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2307.05646

Country:

North America > Canada > Alberta (0.15)
Asia > China > Hong Kong (0.05)
Europe > Italy > Tuscany > Florence (0.04)
(9 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.87)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.87)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.78)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.61)

Add feedback

Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages

Nie, Ercong, Liang, Sheng, Schmid, Helmut, Schütze, Hinrich

arXiv.org Artificial IntelligenceJul-10-2023

Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.

computational linguistic, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2212.09651

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > Dominican Republic (0.04)
(8 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

S2vNTM: Semi-supervised vMF Neural Topic Modeling

Xu, Weijie, Desai, Jay, Sengamedu, Srinivasan, Jiang, Xiaoyu, Iannacci, Francis

arXiv.org Artificial IntelligenceJul-6-2023

Language model based methods are powerful techniques for text classification. However, the models have several shortcomings. In this paper, we propose Semi-Supervised vMF Neural Topic Modeling (S2vNTM) to overcome these difficulties. S2vNTM takes a few seed keywords as input for topics. S2vNTM leverages the pattern of keywords to identify potential topics, as well as optimize the quality of topics' keywords sets. Across a variety of datasets, S2vNTM outperforms existing semi-supervised topic modeling methods in classification accuracy with limited keywords provided. S2vNTM is at least twice as fast as baselines. Language Model (LM) pre-training Vaswani et al. (2017); Devlin et al. (2018) has proven to be useful in learning universal language representations. Recent language models such as Yang et al. (2019); Sun et al. (2019); Chen et al. (2022); Ding et al. (2021) have achieved amazing results in text classification. Most of these methods need enough high-quality labels to train. To make LM based methods work well when limited labels are available, few shot learning methods such as Bianchi et al. (2021); Meng et al. (2020a;b); Mekala and Shang (2020); Yu et al. (2021); Wang et al. (2021b) have been proposed. However, these methods rely on large pre-trained texts and can be biased to apply to a different environment. Topic modeling methods generate topics based on the pattern of words.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2307.04804

Country:

Asia > Middle East > Jordan (0.04)
Asia > Middle East > Iraq (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.87)
(2 more...)

Add feedback

Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

Karpov, Dmitry, Burtsev, Mikhail

arXiv.org Artificial IntelligenceJul-4-2023

This article investigates the knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large sample number (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the "Yandex Que" raw data. By evaluating the RuQTopics - trained models on the six matching classes of the Russian MASSIVE subset, we have proved that the RuQTopics dataset is suitable for real-world conversational tasks, as the Russian-only models trained on this dataset consistently yield an accuracy around 85\% on this subset. We also have figured out that for the multilingual BERT, trained on the RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11) with the approximate size of the pretraining BERT's data for the corresponding language. At the same time, the correlation of the language-wise accuracy with the linguistical distance from Russian is not statistically significant.

dataset, knowledge transfer, topic classification, (14 more...)

arXiv.org Artificial Intelligence

2306.07797

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)
(3 more...)

Genre: Research Report > Experimental Study > Negative Result (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.66)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation

Xu, Weijie, Jiang, Xiaoyu, Desai, Jay, Han, Bin, Yan, Fuqin, Iannacci, Francis

arXiv.org Artificial IntelligenceJul-4-2023

In text classification tasks, fine tuning pretrained language models like BERT and GPT-3 yields competitive accuracy; however, both methods require pretraining on large text datasets. In contrast, general topic modeling methods possess the advantage of analyzing documents to extract meaningful patterns of words without the need of pretraining. To leverage topic modeling's unsupervised insights extraction on text classification tasks, we develop the Knowledge Distillation Semi-supervised Topic Modeling (KDSTM). KDSTM requires no pretrained embeddings, few labeled documents and is efficient to train, making it ideal under resource constrained settings. Across a variety of datasets, our method outperforms existing supervised topic modeling methods in classification accuracy, robustness and efficiency and achieves similar performance compare to state of the art weakly supervised text classification methods.

machine learning, natural language, topic modeling, (17 more...)

arXiv.org Artificial Intelligence

2307.01878

Country:

Asia > Middle East > Jordan (0.05)
North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification

Wu, Junjie, Yeung, Dit-Yan

arXiv.org Artificial IntelligenceJul-4-2023

Despite their promising performance across various natural language processing (NLP) tasks, current NLP systems are vulnerable to textual adversarial attacks. To defend against these attacks, most existing methods apply adversarial training by incorporating adversarial examples. However, these methods have to rely on ground-truth labels to generate adversarial examples, rendering it impractical for large-scale model pre-training which is commonly used nowadays for NLP and many other tasks. In this paper, we propose a novel learning framework called SCAT (Self-supervised Contrastive Learning via Adversarial Training), which can learn robust representations without requiring labeled data. Specifically, SCAT modifies random augmentations of the data in a fully labelfree manner to generate adversarial examples. Adversarial training is achieved by minimizing the contrastive loss between the augmentations and their adversarial counterparts. We evaluate SCAT on two text classification datasets using two state-of-the-art attack schemes proposed recently. Our results show that SCAT can not only train robust language models from scratch, but it can also significantly improve the robustness of existing pre-trained language models. Moreover, to demonstrate its flexibility, we show that SCAT can also be combined with supervised adversarial training to further enhance model robustness.

computational linguistic, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2307.01488

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > China > Hong Kong (0.04)
(13 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

PromptBoosting: Black-Box Text Classification with Ten Forward Passes

Hou, Bairu, O'Connor, Joe, Andreas, Jacob, Chang, Shiyu, Zhang, Yang

arXiv.org Artificial IntelligenceJul-2-2023

We describe PromptBoosting, a query-efficient procedure for building a text classifier from a neural language model (LM) without access to the LM's parameters, gradients, or hidden representations. This form of "black-box" classifier training has become increasingly important as the cost of training and inference in large-scale LMs grows. But existing black-box LM classifier learning approaches are themselves computationally inefficient, typically specializing LMs to the target task by searching in a large space of (discrete or continuous) prompts using zeroth-order optimization methods. Instead of directly optimizing in prompt space, PromptBoosting obtains a small pool of prompts via a gradient-free approach and then constructs a large pool of weak learners by pairing these prompts with different elements of the LM's output distribution. These weak learners are then ensembled using the AdaBoost algorithm. The entire learning process requires only a small number of forward passes and no backward pass. Experiments show that PromptBoosting achieves state-of-the-art performance in multiple black-box few-shot classification tasks, and matches or outperforms full fine-tuning in both few-shot and standard learning paradigms, while training 10x faster than existing black-box methods.

machine learning, natural language, oosting, (18 more...)

arXiv.org Artificial Intelligence

2212.09257

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)

Genre: Research Report (1.00)

Industry: Transportation > Air (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.65)

Add feedback

Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin

Lin, Pin-Jie, Saeed, Muhammed, Chang, Ernie, Scholman, Merel

arXiv.org Artificial IntelligenceJul-1-2023

Developing effective spoken language processing systems for low-resource languages poses several challenges due to the lack of parallel data and limited resources for fine-tuning models. In this work, we target on improving upon both text classification and translation of Nigerian Pidgin (Naija) by collecting a large-scale parallel English-Pidgin corpus and further propose a framework of cross-lingual adaptive training that includes both continual and task adaptive training so as to adapt a base pre-trained model to low-resource languages. Our studies show that English pre-trained language models serve as a stronger prior than multilingual language models on English-Pidgin tasks with up to 2.38 BLEU improvements; and demonstrate that augmenting orthographic data and using task adaptive training with back-translation can have a significant impact on model performance.

artificial intelligence, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2307.00382

Country:

Europe > Germany > Saarland (0.05)
Africa > Nigeria (0.04)
North America > Dominican Republic (0.04)
(3 more...)

Genre: Research Report > New Finding (0.47)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.51)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.35)

Add feedback

weighted CapsuleNet networks for Persian multi-domain sentiment analysis

Kobari, Mahboobeh Sadat, Karimi, Nima, Pourhosseini, Benyamin, Mousa, Ramin

arXiv.org Artificial IntelligenceJul-1-2023

Sentiment classification is a fundamental task in natural language processing, assigning one of the three classes, positive, negative, or neutral, to free texts. However, sentiment classification models are highly domain dependent; the classifier may perform classification with reasonable accuracy in one domain but not in another due to the Semantic multiplicity of words getting poor accuracy. This article presents a new Persian/Arabic multi-domain sentiment analysis method using the cumulative weighted capsule networks approach. Weighted capsule ensemble consists of training separate capsule networks for each domain and a weighting measure called domain belonging degree (DBD). This criterion consists of TF and IDF, which calculates the dependency of each document for each domain separately; this value is multiplied by the possible output that each capsule creates. In the end, the sum of these multiplications is the title of the final output, and is used to determine the polarity. And the most dependent domain is considered the final output for each domain. The proposed method was evaluated using the Digikala dataset and obtained acceptable accuracy compared to the existing approaches. It achieved an accuracy of 0.89 on detecting the domain of belonging and 0.99 on detecting the polarity. Also, for the problem of dealing with unbalanced classes, a cost-sensitive function was used. This function was able to achieve 0.0162 improvements in accuracy for sentiment classification. This approach on Amazon Arabic data can achieve 0.9695 accuracies in domain classification.

classification, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2306.17068

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > United States (0.04)
Asia > Middle East > Iran > Zanjan Province > Zanjan (0.04)
Asia > Middle East > Iran > Yazd Province > Yazd (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
(2 more...)

Add feedback