AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Two-in-One: A Model Hijacking Attack Against Text Generation Models

Si, Wai Man, Backes, Michael, Zhang, Yang, Salem, Ahmed

arXiv.org Artificial IntelligenceMay-12-2023

Machine learning has progressed significantly in various applications ranging from face recognition to text generation. However, its success has been accompanied by different attacks. Recently a new attack has been proposed which raises both accountability and parasitic computing risks, namely the model hijacking attack. Nevertheless, this attack has only focused on image classification tasks. In this work, we broaden the scope of this attack to include text generation and classification models, hence showing its broader applicability. More concretely, we propose a new model hijacking attack, Ditto, that can hijack different text classification tasks into multiple generation ones, e.g., language translation, text summarization, and language modeling. We use a range of text benchmark datasets such as SST-2, TweetEval, AGnews, QNLI, and IMDB to evaluate the performance of our attacks. Our results show that by using Ditto, an adversary can successfully hijack text generation models without jeopardizing their utility.

machine learning, natural language, text classification, (22 more...)

arXiv.org Artificial Intelligence

2305.07406

Country:

Europe > Germany > Berlin (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law Enforcement & Public Safety > Terrorism (0.94)
Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.66)

Add feedback

RepCL: Exploring Effective Representation for Continual Text Classification

Song, Yifan, Wang, Peiyi, Zhu, Dawei, Liu, Tianyu, Sui, Zhifang, Li, Sujian

arXiv.org Artificial IntelligenceMay-12-2023

Continual learning (CL) aims to constantly learn new knowledge over time while avoiding catastrophic forgetting on old tasks. In this work, we focus on continual text classification under the class-incremental setting. Recent CL studies find that the representations learned in one task may not be effective for other tasks, namely representation bias problem. For the first time we formally analyze representation bias from an information bottleneck perspective and suggest that exploiting representations with more class-relevant information could alleviate the bias. To this end, we propose a novel replay-based continual text classification method, RepCL. Our approach utilizes contrastive and generative representation learning objectives to capture more class-relevant features. In addition, RepCL introduces an adversarial replay strategy to alleviate the overfitting problem of replay. Experiments demonstrate that RepCL effectively alleviates forgetting and achieves state-of-the-art performance on three text classification tasks.

machine learning, natural language, text classification, (16 more...)

arXiv.org Artificial Intelligence

2305.07289

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(13 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A General-Purpose Multilingual Document Encoder

Galoğlu, Onur, Litschko, Robert, Glavaš, Goran

arXiv.org Artificial IntelligenceMay-11-2023

Massively multilingual pretrained transformers (MMTs) have tremendously pushed the state of the art on multilingual NLP and cross-lingual transfer of NLP models in particular. While a large body of work leveraged MMTs to mine parallel data and induce bilingual document embeddings, much less effort has been devoted to training general-purpose (massively) multilingual document encoder that can be used for both supervised and unsupervised document-level tasks. In this work, we pretrain a massively multilingual document encoder as a hierarchical transformer model (HMDE) in which a shallow document transformer contextualizes sentence representations produced by a state-of-the-art pretrained multilingual sentence encoder. We leverage Wikipedia as a readily available source of comparable documents for creating training data, and train HMDE by means of a cross-lingual contrastive objective, further exploiting the category hierarchy of Wikipedia for creation of difficult negatives. We evaluate the effectiveness of HMDE in two arguably most common and prominent cross-lingual document-level tasks: (1) cross-lingual transfer for topical document classification and (2) cross-lingual document retrieval. HMDE is significantly more effective than (i) aggregations of segment-based representations and (ii) multilingual Longformer. Crucially, owing to its massively multilingual lower transformer, HMDE successfully generalizes to languages unseen in document-level pretraining. We publicly release our code and models at https://github.com/ogaloglu/pre-training-multilingual-document-encoders .

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

2305.07016

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York > New York County > New York City (0.04)
(8 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.35)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.34)

Add feedback

Word Grounded Graph Convolutional Network

Lu, Zhibin, Xie, Qianqian, Wang, Benyou, Nie, Jian-yun

arXiv.org Artificial IntelligenceMay-10-2023

Graph Convolutional Networks (GCNs) have shown strong performance in learning text representations for various tasks such as text classification, due to its expressive power in modeling graph structure data (e.g., a literature citation network). Most existing GCNs are limited to deal with documents included in a pre-defined graph, i.e., it cannot be generalized to out-of-graph documents. To address this issue, we propose to transform the document graph into a word graph, to decouple data samples (i.e., documents in training and test sets) and a GCN model by using a document-independent graph. Such word-level GCN could therefore naturally inference out-of-graph documents in an inductive way. The proposed Word-level Graph (WGraph) can not only implicitly learning word presentation with commonly-used word co-occurrences in corpora, but also incorporate extra global semantic dependency derived from inter-document relationships (e.g., literature citations). An inductive Word-grounded Graph Convolutional Network (WGCN) is proposed to learn word and document representations based on WGraph in a supervised manner. Experiments on text classification with and without citation networks evidence that the proposed WGCN model outperforms existing methods in terms of effectiveness and efficiency.

machine learning, natural language, text classification, (19 more...)

arXiv.org Artificial Intelligence

2305.06434

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Curating corpora with classifiers: A case study of clean energy sentiment online

Arnold, Michael V., Dodds, Peter Sheridan, Danforth, Christopher M.

arXiv.org Artificial IntelligenceMay-9-2023

Well curated, large-scale corpora of social media posts containing broad public opinion offer an alternative data source to complement traditional surveys. While surveys are effective at collecting representative samples and are capable of achieving high accuracy, they can be both expensive to run and lag public opinion by days or weeks. Both of these drawbacks could be overcome with a real-time, high volume data stream and fast analysis pipeline. A central challenge in orchestrating such a data pipeline is devising an effective method for rapidly selecting the best corpus of relevant documents for analysis. Querying with keywords alone often includes irrelevant documents that are not easily disambiguated with bag-of-words natural language processing methods. Here, we explore methods of corpus curation to filter irrelevant tweets using pre-trained transformer-based models, fine-tuned for our binary classification task on hand-labeled tweets. We are able to achieve F1 scores of up to 0.95. The low cost and high performance of fine-tuning such a model suggests that our approach could be of broad benefit as a pre-processing step for social media datasets with uncertain corpus boundaries.

machine learning, natural language, text classification, (21 more...)

arXiv.org Artificial Intelligence

2305.03092

Country:

North America > United States > Vermont > Chittenden County > Burlington (0.14)
Europe > Norway (0.04)
South America (0.04)
(6 more...)

Genre: Research Report (0.64)

Industry:

Government (1.00)
Energy > Renewable > Solar (1.00)
Health & Medicine (0.94)
(2 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

An Exploration of Encoder-Decoder Approaches to Multi-Label Classification for Legal and Biomedical Text

Kementchedjhieva, Yova, Chalkidis, Ilias

arXiv.org Artificial IntelligenceMay-9-2023

Standard methods for multi-label text classification largely rely on encoder-only pre-trained language models, whereas encoder-decoder models have proven more effective in other classification tasks. In this study, we compare four methods for multi-label classification, two based on an encoder only, and two based on an encoder-decoder. We carry out experiments on four datasets -- two in the legal domain and two in the biomedical domain, each with two levels of label granularity -- and always depart from the same pre-trained model, T5. Our results show that encoder-decoder methods outperform encoder-only methods, with a growing advantage on more complex datasets and labeling schemes of finer granularity. Using encoder-decoder models in a non-autoregressive fashion, in particular, yields the best performance overall, so we further study this approach through ablations to better understand its strengths.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

2305.05627

Country:

Europe > United Kingdom (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Government > Regional Government > Europe Government (0.68)
Law > International Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.34)

Add feedback

What Do Patients Say About Their Disease Symptoms? Deep Multilabel Text Classification With Human-in-the-Loop Curation for Automatic Labeling of Patient Self Reports of Problems

Arbatti, Lakshmi, Hosamath, Abhishek, Ramanarayanan, Vikram, Shoulson, Ira

arXiv.org Artificial IntelligenceMay-8-2023

The USA Food and Drug Administration has accorded increasing importance to patient-reported problems in clinical and research settings. In this paper, we explore one of the largest online datasets comprising 170,141 open-ended self-reported responses (called "verbatims") from patients with Parkinson's (PwPs) to questions about what bothers them about their Parkinson's Disease and how it affects their daily functioning, also known as the Parkinson's Disease Patient Report of Problems. Classifying such verbatims into multiple clinically relevant symptom categories is an important problem and requires multiple steps - expert curation, a multi-label text classification (MLTC) approach and large amounts of labelled training data. Further, human annotation of such large datasets is tedious and expensive. We present a novel solution to this problem where we build a baseline dataset using 2,341 (of the 170,141) verbatims annotated by nine curators including clinical experts and PwPs. We develop a rules based linguistic-dictionary using NLP techniques and graph database-based expert phrase-query system to scale the annotation to the remaining cohort generating the machine annotated dataset, and finally build a Keras-Tensorflow based MLTC model for both datasets. The machine annotated model significantly outperforms the baseline model with a F1-score of 95% across 65 symptom categories on a held-out test set.

machine learning, natural language, text classification, (15 more...)

arXiv.org Artificial Intelligence

2305.04905

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Parkinson's Disease (1.00)
Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)
Government > Regional Government > North America Government > United States Government > FDA (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.71)

Add feedback

Using Language Models on Low-end Hardware

Ziegner, Fabian, Borst, Janos, Niekler, Andreas, Potthast, Martin

arXiv.org Artificial IntelligenceMay-8-2023

The transition to neural networks as primary machine learning paradigm in natural language processing (NLP), and especially pre-training language models, became a major driver in NLP tasks within the Digital Humanities. Many applications in fields ranging, among other things, from Library Science, Literature Studies or Cultural Studies have been dramatically improved and automation of text based tasks is becoming widely possible. Current state-of-the-art approaches utilize pre-trained neural language models, which are fine-tuned to a given set of target variables (i.e., by training all parameters of the language model). Training neural networks requires calculating a gradient for every layer and batch element, thus easily tripling the required memory. Those complex and multi-step architectures often use specific hardware, for example Graphics processing units (GPU), in order to be efficiently trained.

machine learning, natural language, text classification, (16 more...)

arXiv.org Artificial Intelligence

2305.0235

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Saxony > Leipzig (0.05)
Europe > Italy > Tuscany > Florence (0.04)
(11 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Enhancing Pashto Text Classification using Language Processing Techniques for Single And Multi-Label Analysis

Dawodi, Mursal, Baktash, Jawid Ahmad

arXiv.org Artificial IntelligenceMay-4-2023

Text classification has become a crucial task in various fields, leading to a significant amount of research on developing automated text classification systems for national and international languages. However, there is a growing need for automated text classification systems that can handle local languages. This study aims to establish an automated classification system for Pashto text. We also evaluated two different feature extraction methods, bag of words and Term Frequency Inverse Document Frequency. The study achieved an average testing accuracy rate of 94% using the MLP classification algorithm and TFIDF feature extraction method in single-label multiclass classification. Similarly, MLP+TFIDF yielded the best results, with an F1-measure of 0.81. Furthermore, the use of pre-trained language representation models, such as DistilBERT, showed promising results for Pashto text classification; however, the study highlights the importance of developing a specific tokenizer for a particular language to achieve reasonable results. NTRODUCTION The evolution of technology instigated the existence of an overwhelming number of electronic documents therefore text mining becomes a crucial task. Many businesses and individuals use machine learning techniques to classify documents accurately and quickly. On the other hand, more than 80% of organization information is in electronic format including news, email, data about users, reports, etc. (Raghavan, 2004). Text mining attracted the attention of researchers to automatically figure out the patterns of millions of electronic texts.

classification, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2305.03201

Country:

Asia > Middle East > Republic of Türkiye (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
(10 more...)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.31)

Add feedback

ReMask: A Robust Information-Masking Approach for Domain Counterfactual Generation

Hong, Pengfei, Bhardwaj, Rishabh, Majumdar, Navonil, Aditya, Somak, Poria, Soujanya

arXiv.org Artificial IntelligenceMay-4-2023

Domain shift is a big challenge in NLP, thus, many approaches resort to learning domain-invariant features to mitigate the inference phase domain shift. Such methods, however, fail to leverage the domain-specific nuances relevant to the task at hand. To avoid such drawbacks, domain counterfactual generation aims to transform a text from the source domain to a given target domain. However, due to the limited availability of data, such frequency-based methods often miss and lead to some valid and spurious domain-token associations. Hence, we employ a three-step domain obfuscation approach that involves frequency and attention norm-based masking, to mask domain-specific cues, and unmasking to regain the domain generic context. Our experiments empirically show that the counterfactual samples sourced from our masked text lead to improved domain transfer on 10 out of 12 domain sentiment classification settings, with an average of 2% accuracy improvement over the state-of-the-art for unsupervised domain adaptation (UDA). Further, our model outperforms the state-of-the-art by achieving 1.4% average accuracy improvement in the adversarial domain adaptation (ADA) setting. Moreover, our model also shows its domain adaptation efficacy on a large multi-domain intent classification dataset where it attains state-of-the-art results. We release the codes publicly at \url{https://github.com/declare-lab/remask}.

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

2305.02858

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Singapore (0.04)
Asia > India > West Bengal > Kharagpur (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.34)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.34)

Add feedback