AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

LIME: Weakly-Supervised Text Classification Without Seeds

Park, Seongmin, Lee, Jihwa

arXiv.org Artificial IntelligenceOct-13-2022

In weakly-supervised text classification, only label names act as sources of supervision. Predominant approaches to weakly-supervised text classification utilize a two-phase framework, where test samples are first assigned pseudo-labels and are then used to train a neural text classifier. In most previous work, the pseudo-labeling step is dependent on obtaining seed words that best capture the relevance of each class label. We present LIME, a framework for weakly-supervised text classification that entirely replaces the brittle seed-word generation process with entailment-based pseudo-classification. We find that combining weakly-supervised classification and textual entailment mitigates shortcomings of both, resulting in a more streamlined and effective classification pipeline. With just an off-the-shelf textual entailment model, LIME outperforms recent baselines in weakly-supervised text classification and achieves state-of-the-art in 4 benchmarks. We open source our code at https://github.com/seongminp/LIME.

classification, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2210.0672

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)

Add feedback

An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification

Chalkidis, Ilias, Dai, Xiang, Fergadiotis, Manos, Malakasiotis, Prodromos, Elliott, Desmond

arXiv.org Artificial IntelligenceOct-11-2022

Non-hierarchical sparse attention Transformer-based models, such as Longformer and Big Bird, are popular approaches to working with long documents. There are clear benefits to these approaches compared to the original Transformer in terms of efficiency, but Hierarchical Attention Transformer (HAT) models are a vastly understudied alternative. We develop and release fully pre-trained HAT models that use segment-wise followed by cross-segment encoders and compare them with Longformer models and partially pre-trained HATs. In several long document downstream classification tasks, our best HAT model outperforms equally-sized Longformer models while using 10-20% less GPU memory and processing documents 40-45% faster. In a series of ablation studies, we find that HATs perform best with cross-segment contextualization throughout the model than alternative configurations that implement either early or late cross-segment contextualization. Our code is on GitHub: https://github.com/coastalcph/hierarchical-transformers.

large language model, longformer, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2210.05529

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Dominican Republic (0.04)
(10 more...)

Genre: Research Report > New Finding (0.46)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.65)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

Public Wisdom Matters! Discourse-Aware Hyperbolic Fourier Co-Attention for Social-Text Classification

Grover, Karish, Angara, S. M. Phaneendra, Akhtar, Md. Shad, Chakraborty, Tanmoy

arXiv.org Artificial IntelligenceOct-11-2022

Social media has become the fulcrum of all forms of communication. Classifying social texts such as fake news, rumour, sarcasm, etc. has gained significant attention. The surface-level signals expressed by a social-text itself may not be adequate for such tasks; therefore, recent methods attempted to incorporate other intrinsic signals such as user behavior and the underlying graph structure. Oftentimes, the `public wisdom' expressed through the comments/replies to a social-text acts as a surrogate of crowd-sourced view and may provide us with complementary signals. State-of-the-art methods on social-text classification tend to ignore such a rich hierarchical signal. Here, we propose Hyphen, a discourse-aware hyperbolic spectral co-attention network. Hyphen is a fusion of hyperbolic graph representation learning with a novel Fourier co-attention mechanism in an attempt to generalise the social-text classification tasks by incorporating public discourse. We parse public discourse as an Abstract Meaning Representation (AMR) graph and use the powerful hyperbolic geometric representation to model graphs with hierarchical structure. Finally, we equip it with a novel Fourier co-attention mechanism to capture the correlation between the source post and public discourse. Extensive experiments on four different social-text classification tasks, namely detecting fake news, hate speech, rumour, and sarcasm, show that Hyphen generalises well, and achieves state-of-the-art results on ten benchmark datasets. We also employ a sentence-level fact-checked and annotated dataset to evaluate how Hyphen is capable of producing explanations as analogous evidence to the final prediction.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

2209.13017

Country:

Asia > India > NCT > Delhi (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.68)

Industry:

Media > News (0.71)
Health & Medicine > Therapeutic Area (0.47)
Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text Classification

Nishikawa, Sosuke, Yamada, Ikuya, Tsuruoka, Yoshimasa, Echizen, Isao

arXiv.org Artificial IntelligenceOct-11-2022

Inspired learning, models are trained on annotated data in a by previous work (Yamada and Shindo, 2019; Peters resource-rich language (the source language) and et al., 2019), we compute the weights using then applied to another language (the target language) an attention mechanism that selects the entities relevant without any training. Substantial progress to the given document. We then compute in cross-lingual transfer learning has been made the sum of the entity-based document representation using multilingual pre-trained language models and the text-based document representation (PLMs), such as multilingual BERT (M-BERT), computed using the PLM and feed it into a linear jointly trained on massive corpora in multiple languages classifier. Since the entity vocabulary and entity (Devlin et al., 2019; Conneau and Lample, embedding are shared across languages, a model 2019; Conneau et al., 2020a). However, recent empirical trained on entity features in the source language can studies have found that cross-lingual transfer be directly transferred to multiple target languages.

classification, information retrieval, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2110.07792

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
Europe > Germany (0.04)
Asia > Taiwan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.68)

Add feedback

HPT: Hierarchy-aware Prompt Tuning for Hierarchical Text Classification

Wang, Zihan, Wang, Peiyi, Liu, Tianyu, Lin, Binghuai, Cao, Yunbo, Sui, Zhifang, Wang, Houfeng

arXiv.org Artificial IntelligenceOct-10-2022

Hierarchical text classification (HTC) is a challenging subtask of multi-label classification due to its complex label hierarchy. Recently, the pretrained language models (PLM)have been widely adopted in HTC through a fine-tuning paradigm. However, in this paradigm, there exists a huge gap between the classification tasks with sophisticated label hierarchy and the masked language model (MLM) pretraining tasks of PLMs and thus the potentials of PLMs can not be fully tapped. To bridge the gap, in this paper, we propose HPT, a Hierarchy-aware Prompt Tuning method to handle HTC from a multi-label MLM perspective. Specifically, we construct a dynamic virtual template and label words that take the form of soft prompts to fuse the label hierarchy knowledge and introduce a zero-bounded multi-label cross entropy loss to harmonize the objectives of HTC and MLM. Extensive experiments show HPT achieves state-of-the-art performances on 3 popular HTC datasets and is adept at handling the imbalance and low resource situations. Our code is available at https://github.com/wzh9969/HPT.

classification, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2204.13413

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Hong Kong (0.04)
North America > United States > Colorado > Denver County > Denver (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Pytorch & C++ #6: Bert Text Classification in C++

#artificialintelligenceOct-9-2022, 08:00:23 GMT

I am trying to provide examples/practices/projects with Pytorch c API in this series. You can check the first blog before diving if you are new to this series. In this story, we will train a Bert model to classify tweets as offensive or not. All codes are available in this Github repo. We will use Hate Speech Detection Dataset.

bert model, bert text classification, vocab, (2 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.64)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.40)

Add feedback

KG-MTT-BERT: Knowledge Graph Enhanced BERT for Multi-Type Medical Text Classification

He, Yong, Wang, Cheng, Zhang, Shun, Li, Nan, Li, Zhaorong, Zeng, Zhenyu

arXiv.org Artificial IntelligenceOct-8-2022

Medical text learning has recently emerged as a promising area to improve healthcare due to the wide adoption of electronic health record (EHR) systems. The complexity of the medical text such as diverse length, mixed text types, and full of medical jargon, poses a great challenge for developing effective deep learning models. BERT has presented state-of-the-art results in many NLP tasks, such as text classification and question answering. However, the standalone BERT model cannot deal with the complexity of the medical text, especially the lengthy clinical notes. Herein, we develop a new model called KG-MTT-BERT (Knowledge Graph Enhanced Multi-Type Text BERT) by extending the BERT model for long and multi-type text with the integration of the medical knowledge graph. Our model can outperform all baselines and other state-of-the-art models in diagnosis-related group (DRG) classification, which requires comprehensive medical text for accurate classification. We also demonstrated that our model can effectively handle multi-type text and the integration of medical knowledge graph can significantly improve the performance.

classification, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2210.0397

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.51)
Instructional Material > Online (0.41)

Industry: Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification

Song, Linxin, Zhang, Jieyu, Yang, Tianxiang, Goto, Masayuki

arXiv.org Artificial IntelligenceOct-7-2022

To obtain a large amount of training labels inexpensively, researchers have recently adopted the weak supervision (WS) paradigm, which leverages labeling rules to synthesize training labels rather than using individual annotations to achieve competitive results for natural language processing (NLP) tasks. However, data imbalance is often overlooked in applying the WS paradigm, despite being a common issue in a variety of NLP tasks. To address this challenge, we propose Adaptive Ranking-based Sample Selection (ARS2), a model-agnostic framework to alleviate the data imbalance issue in the WS paradigm. Specifically, it calculates a probabilistic margin score based on the output of the current model to measure and rank the cleanliness of each data point. Then, the ranked data are sampled based on both class-wise and rule-aware ranking. In particular, the two sample strategies corresponds to our motivations: (1) to train the model with balanced data batches to reduce the data imbalance issue and (2) to exploit the expertise of each labeling rule for collecting clean samples. Experiments on four text classification datasets with four different imbalance ratios show that ARS2 outperformed the state-of-the-art imbalanced learning and WS methods, leading to a 2%-57.8% improvement on their F1-score.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2210.03092

Country:

Asia > Japan > Shikoku > Kagawa Prefecture > Takamatsu (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Augmentor or Filter? Reconsider the Role of Pre-trained Language Model in Text Classification Augmentation

Yang, Heng, Li, Ke

arXiv.org Artificial IntelligenceOct-6-2022

Text augmentation is one of the most effective techniques to solve the critical problem of insufficient data in text classification. Existing text augmentation methods achieve hopeful performance in few-shot text data augmentation. However, these methods usually lead to performance degeneration on public datasets due to poor quality augmentation instances. Our study shows that even employing pre-trained language models, existing text augmentation methods generate numerous low-quality instances and lead to the feature space shift problem in augmentation instances. However, we note that the pre-trained language model is good at finding low-quality instances provided that it has been fine-tuned on the target dataset. To alleviate the feature space shift and performance degeneration in existing text augmentation methods, we propose BOOSTAUG, which reconsiders the role of the language model in text augmentation and emphasizes the augmentation instance filtering rather than generation. We evaluate BOOSTAUG on both sentence-level text classification and aspect-based sentiment classification. The experimental results on seven commonly used text classification datasets show that our augmentation method obtains state-of-the-art performance. Moreover, BOOSTAUG is a flexible framework; we release the code which can help improve existing augmentation methods.

artificial intelligence, natural language, text classification, (14 more...)

arXiv.org Artificial Intelligence

2210.02941

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(12 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)

Add feedback

Privacy-Preserving Text Classification on BERT Embeddings with Homomorphic Encryption

Lee, Garam, Kim, Minsoo, Park, Jai Hyun, Hwang, Seung-won, Cheon, Jung Hee

arXiv.org Artificial IntelligenceOct-5-2022

Embeddings, which compress information in raw text into semantics-preserving low-dimensional vectors, have been widely adopted for their efficacy. However, recent research has shown that embeddings can potentially leak private information about sensitive attributes of the text, and in some cases, can be inverted to recover the original input text. To address these growing privacy challenges, we propose a privatization mechanism for embeddings based on homomorphic encryption, to prevent potential leakage of any piece of information in the process of text classification. In particular, our method performs text classification on the encryption of embeddings from state-of-the-art models like BERT, supported by an efficient GPU implementation of CKKS encryption scheme. We show that our method offers encrypted protection of BERT embeddings, while largely preserving their utility on downstream text classification tasks.

computational linguistic, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2210.02574

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(4 more...)

Genre: Research Report > New Finding (0.49)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.51)

Add feedback