AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

de Silva, Nisansa

arXiv.org Artificial IntelligenceJan-4-2024

Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field.

ieee, international conference, sinhala, (14 more...)

arXiv.org Artificial Intelligence

1906.02358

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States > New York (0.04)
(13 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Media > News (1.00)
Information Technology > Services (1.00)
Education (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(12 more...)

Add feedback

Iterative Mask Filling: An Effective Text Augmentation Method Using Masked Language Modeling

Kesgin, Himmet Toprak, Amasyali, Mehmet Fatih

arXiv.org Artificial IntelligenceJan-3-2024

Data augmentation is an effective technique for improving the performance of machine learning models. However, it has not been explored as extensively in natural language processing (NLP) as it has in computer vision. In this paper, we propose a novel text augmentation method that leverages the Fill-Mask feature of the transformer-based BERT model. Our method involves iteratively masking words in a sentence and replacing them with language model predictions. We have tested our proposed method on various NLP tasks and found it to be effective in many cases. Our results are presented along with a comparison to existing augmentation methods. Experimental results show that our proposed method significantly improves performance, especially on topic classification datasets.

augmentation, augmentation method, dataset, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-50920-9_35

2401.0183

Country:

North America > United States > New York (0.05)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Punctuation Matters! Stealthy Backdoor Attack for Language Models

Sheng, Xuan, Li, Zhicheng, Han, Zhaoyang, Chang, Xiangmao, Li, Piji

arXiv.org Artificial IntelligenceDec-25-2023

Recent studies have pointed out that natural language processing (NLP) models are vulnerable to backdoor attacks. A backdoored model produces normal outputs on the clean samples while performing improperly on the texts with triggers that the adversary injects. However, previous studies on textual backdoor attack pay little attention to stealthiness. Moreover, some attack methods even cause grammatical issues or change the semantic meaning of the original texts. Therefore, they can easily be detected by humans or defense systems. In this paper, we propose a novel stealthy backdoor attack method against textual models, which is called \textbf{PuncAttack}. It leverages combinations of punctuation marks as the trigger and chooses proper locations strategically to replace them. Through extensive experiments, we demonstrate that the proposed method can effectively compromise multiple models in various tasks. Meanwhile, we conduct automatic evaluation and human inspection, which indicate the proposed method possesses good performance of stealthiness without bringing grammatical issues and altering the meaning of sentences.

backdoor attack, dataset, punctuation mark, (15 more...)

arXiv.org Artificial Intelligence

2312.15867

Country: Asia > China > Jiangsu Province > Nanjing (0.05)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

TACIT: A Target-Agnostic Feature Disentanglement Framework for Cross-Domain Text Classification

Song, Rui, Giunchiglia, Fausto, Li, Yingji, Tian, Mingjie, Xu, Hao

arXiv.org Artificial IntelligenceDec-24-2023

Cross-domain text classification aims to transfer models from label-rich source domains to label-poor target domains, giving it a wide range of practical applications. Many approaches promote cross-domain generalization by capturing domain-invariant features. However, these methods rely on unlabeled samples provided by the target domains, which renders the model ineffective when the target domain is agnostic. Furthermore, the models are easily disturbed by shortcut learning in the source domain, which also hinders the improvement of domain generalization ability. To solve the aforementioned issues, this paper proposes TACIT, a target domain agnostic feature disentanglement framework which adaptively decouples robust and unrobust features by Variational Auto-Encoders. Additionally, to encourage the separation of unrobust features from robust features, we design a feature distillation task that compels unrobust features to approximate the output of the teacher. The teacher model is trained with a few easy samples that are easy to carry potential unknown shortcuts. Experimental results verify that our framework achieves comparable results to state-of-the-art baselines while utilizing only source domain data.

computational linguistic, target domain, unrobust feature, (14 more...)

arXiv.org Artificial Intelligence

2312.17263

Country:

Asia > China > Jilin Province (0.04)
Asia > China > Chongqing Province > Chongqing (0.04)
Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)
Asia > Japan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.86)

Add feedback

RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal Sentiment Classification

Ye, Junjie, Zhou, Jie, Tian, Junfeng, Wang, Rui, Zhang, Qi, Gui, Tao, Huang, Xuanjing

arXiv.org Artificial IntelligenceDec-23-2023

Recently, Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars. However, current multimodal models have reached a performance bottleneck. To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets to answer the following questions: Q1: Are the modalities equally important for TMSC? Q2: Which multimodal fusion modules are more effective? Q3: Do existing datasets adequately support the research? Our experiments and analyses reveal that the current TMSC systems primarily rely on the textual modality, as most of targets' sentiments can be determined solely by text. Consequently, we point out several directions to work on for the TMSC task in terms of model design and dataset construction. The code and data can be found in https://github.com/Junjie-Ye/RethinkingTMSC.

dataset, multimodal sentiment, sentiment, (17 more...)

arXiv.org Artificial Intelligence

2310.09596

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Lebanon (0.05)
Asia > China > Shanghai > Shanghai (0.05)
(14 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.73)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.73)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.61)

Add feedback

Exploiting Contextual Target Attributes for Target Sentiment Classification

Xing, Bowen, Tsang, Ivor W.

arXiv.org Artificial IntelligenceDec-21-2023

Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task. In this paper, we present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes. Specifically, we design the domain- and target-constrained cloze test, which can leverage the PTLMs' strong language modeling ability to generate the given target's attributes pertaining to the review context. The attributes contain the background and property information of the target, which can help to enrich the semantics of the review context and the target. To exploit the attributes for tackling TSC, we first construct a heterogeneous information graph by treating the attributes as nodes and combining them with (1) the syntax graph automatically produced by the off-the-shelf dependency parser and (2) the semantics graph of the review context, which is derived from the self-attention mechanism. Then we propose a heterogeneous information gated graph convolutional network to model the interactions among the attribute information, the syntactic information, and the contextual information. The experimental results on three benchmark datasets demonstrate the superiority of our model, which achieves new state-of-the-art performance.

contextual target, information, interaction, (14 more...)

arXiv.org Artificial Intelligence

2312.13766

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
Oceania > Australia (0.04)
(7 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.54)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.54)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.42)

Add feedback

A Learning oriented DLP System based on Classification Model

Gupta, Kishu, Kush, Ashwani

arXiv.org Artificial IntelligenceDec-21-2023

Data is the key asset for organizations and data sharing is lifeline for organization growth; which may lead to data loss. Data leakage is the most critical issue being faced by organizations. In order to mitigate the data leakage issues data leakage prevention systems (DLPSs) are deployed at various levels by the organizations. DLPSs are capable to protect all kind of data i.e. DAR, DIM/DIT, DIU. Statistical analysis, regular expression, data fingerprinting are common approaches exercised in DLP system. Out of these techniques; statistical analysis approach is most appropriate for proposed DLP model of data security. This paper defines a statistical DLP model for document classification. Model uses various statistical approaches like TF-IDF (Term Frequency- Inverse Document Frequency) a renowned term count/weighing function, Vectorization, Gradient boosting document classification etc. to classify the documents before allowing any access to it. Machine learning is used to test and train the model. Proposed model also introduces an extremely efficient and more accurate approach; IGBCA (Improvised Gradient Boosting Classification Algorithm); for document classification, to prevent them from possible data leakage. Results depicts that proposed model can classify documents with high accuracy and on basis of which data can be prevented from being loss.

accuracy, dlp system, learning, (15 more...)

arXiv.org Artificial Intelligence

2312.13711

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)
Asia > Singapore (0.04)
(3 more...)

Genre: Research Report (0.40)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)

Add feedback

Compositional Generalization for Multi-label Text Classification: A Data-Augmentation Approach

Chai, Yuyang, Li, Zhuang, Liu, Jiahui, Chen, Lei, Li, Fei, Ji, Donghong, Teng, Chong

arXiv.org Artificial IntelligenceDec-20-2023

Despite significant advancements in multi-label text classification, the ability of existing models to generalize to novel and seldom-encountered complex concepts, which are compositions of elementary ones, remains underexplored. This research addresses this gap. By creating unique data splits across three benchmarks, we assess the compositional generalization ability of existing multi-label text classification models. Our results show that these models often fail to generalize to compositional concepts encountered infrequently during training, leading to inferior performance on tests with these new combinations. To address this, we introduce a data augmentation method that leverages two innovative text generation models designed to enhance the classification models' capacity for compositional generalization. Our experiments show that this data augmentation approach significantly improves the compositional generalization capabilities of classification models on our benchmarks, with both generation models surpassing other text generation baselines.

composition, label composition, representation, (14 more...)

arXiv.org Artificial Intelligence

2312.11276

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States (0.04)
North America > Dominican Republic (0.04)
(4 more...)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Regularized Conditional Alignment for Multi-Domain Text Classification

Hu, Juntao, Wu, Yuan

arXiv.org Artificial IntelligenceDec-18-2023

The most successful multi-domain text classification (MDTC) approaches employ the shared-private paradigm to facilitate the enhancement of domain-invariant features through domain-specific attributes. Additionally, they employ adversarial training to align marginal feature distributions. Nevertheless, these methodologies encounter two primary challenges: (1) Neglecting class-aware information during adversarial alignment poses a risk of misalignment; (2) The limited availability of labeled data across multiple domains fails to ensure adequate discriminative capacity for the model. To tackle these issues, we propose a method called Regularized Conditional Alignment (RCA) to align the joint distributions of domains and classes, thus matching features within the same category and amplifying the discriminative qualities of acquired features. Moreover, we employ entropy minimization and virtual adversarial training to constrain the uncertainty of predictions pertaining to unlabeled data and enhance the model's robustness. Empirical results on two benchmark datasets demonstrate that our RCA approach outperforms state-of-the-art MDTC techniques.

classification, discriminator, text classification, (13 more...)

arXiv.org Artificial Intelligence

2312.11572

Country: Asia > China > Jilin Province > Changchun (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Information Type Classification with Contrastive Task-Specialized Sentence Encoders

Seeberger, Philipp, Bocklet, Tobias, Riedhammer, Korbinian

arXiv.org Artificial IntelligenceDec-18-2023

User-generated information content has become an important information source in crisis situations. However, classification models suffer from noise and event-related biases which still poses a challenging task and requires sophisticated task-adaptation. To address these challenges, we propose the use of contrastive task-specialized sentence encoders for downstream classification. We apply the task-specialization on the CrisisLex, HumAID, and TrecIS information type classification tasks and show performance gains w.r.t. F1-score. Furthermore, we analyse the cross-corpus and cross-lingual capabilities for two German event relevancy classification datasets.

classification, information type classification, proceedings, (11 more...)

arXiv.org Artificial Intelligence

2312.1102

Country:

Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.14)
North America > United States > New York > New York County > New York City (0.05)
North America > Dominican Republic (0.04)
(6 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.69)
Information Technology > Security & Privacy (0.68)
Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.72)

Add feedback