AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Token Classification for Disambiguating Medical Abbreviations

Cevik, Mucahit, Jafari, Sanaz Mohammad, Myers, Mitchell, Yildirim, Savas

arXiv.org Artificial IntelligenceOct-5-2022

Abbreviations are unavoidable yet critical parts of the medical text. Using abbreviations, especially in clinical patient notes, can save time and space, protect sensitive information, and help avoid repetitions. However, most abbreviations might have multiple senses, and the lack of a standardized mapping system makes disambiguating abbreviations a difficult and time-consuming task. The main objective of this study is to examine the feasibility of token classification methods for medical abbreviation disambiguation. Specifically, we explore the capability of token classification methods to deal with multiple unique abbreviations in a single text. We use two public datasets to compare and contrast the performance of several transformer models pre-trained on different scientific and medical corpora. Our proposed token classification approach outperforms the more commonly used text classification models for the abbreviation disambiguation task. In particular, the SciBERT model shows a strong performance for both token and text classification tasks over the two considered datasets. Furthermore, we find that abbreviation disambiguation performance for the text classification models becomes comparable to that of token classification only when postprocessing is applied to their predictions, which involves filtering possible labels for an abbreviation based on the training data.

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

2210.02487

Country:

North America > United States > Minnesota (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.93)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

DocSCAN: Unsupervised Text Classification via Learning from Neighbors

Stammbach, Dominik, Ash, Elliott

arXiv.org Artificial IntelligenceOct-4-2022

We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN). For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels. Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels. On five topic classification benchmarks, we improve on various unsupervised baselines by a large margin. In datasets with relatively few and balanced outcome classes, DocSCAN approaches the performance of supervised classification. The method fails for other types of classification, such as sentiment analysis, pointing to important conceptual and practical differences between classifying images and texts.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2105.04024

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(4 more...)

Genre:

Research Report (0.67)
Overview (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Few-shot Text Classification with Dual Contrastive Consistency

Sun, Liwen, Han, Jiawei

arXiv.org Artificial IntelligenceSep-29-2022

In this paper, we explore how to utilize pre-trained language model to perform few-shot text classification where only a few annotated examples are given for each class. Since using traditional cross-entropy loss to fine-tune language model under this scenario causes serious overfitting and leads to sub-optimal generalization of model, we adopt supervised contrastive learning on few labeled data and consistency-regularization on vast unlabeled data. Moreover, we propose a novel contrastive consistency to further boost model performance and refine sentence representation. After conducting extensive experiments on four datasets, we demonstrate that our model (FTCC) can outperform state-of-the-art methods and has better robustness.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2209.15069

Country:

North America > United States > Illinois > Champaign County > Champaign (0.04)
North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.48)

Add feedback

Towards Simple and Efficient Task-Adaptive Pre-training for Text Classification

Ladkat, Arnav, Miyajiwala, Aamir, Jagadale, Samiksha, Kulkarni, Rekha, Joshi, Raviraj

arXiv.org Artificial IntelligenceSep-26-2022

Language models are pre-trained using large corpora of generic data like book corpus, common crawl and Wikipedia, which is essential for the model to understand the linguistic characteristics of the language. New studies suggest using Domain Adaptive Pre-training (DAPT) and Task-Adaptive Pre-training (TAPT) as an intermediate step before the final finetuning task. This step helps cover the target domain vocabulary and improves the model performance on the downstream task. In this work, we study the impact of training only the embedding layer on the model's performance during TAPT and task-specific finetuning. Based on our study, we propose a simple approach to make the intermediate step of TAPT for BERT-based models more efficient by performing selective pre-training of BERT layers. We show that training only the BERT embedding layer during TAPT is sufficient to adapt to the vocabulary of the target domain and achieve comparable performance. Our approach is computationally efficient, with 78\% fewer parameters trained during TAPT. The proposed embedding layer finetuning approach can also be an efficient domain adaptation technique.

machine learning, natural language, task-adaptive pre-training, (20 more...)

arXiv.org Artificial Intelligence

2209.12943

Country:

Asia > India > Maharashtra > Pune (0.05)
Asia > India > Tamil Nadu > Chennai (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Improving Probabilistic Models in Text Classification via Active Learning

Bosley, Mitchell, Kuzushima, Saki, Enamorado, Ted, Shiraito, Yuki

arXiv.org Artificial IntelligenceSep-26-2022

Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents to train automated classifiers. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. Moreover, we replicate two recently published articles and reach the same substantive conclusions with only a small proportion of the original labeled data used in those studies. We provide activeText, an open-source software to implement our method.

machine learning, natural language, text classification, (19 more...)

arXiv.org Artificial Intelligence

2202.02629

Country:

Asia > Middle East > Syria (0.14)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(11 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Government > Foreign Policy (1.00)
Law > Civil Rights & Constitutional Law (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
(2 more...)

Add feedback

Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future

Klie, Jan-Christoph, Webber, Bonnie, Gurevych, Iryna

arXiv.org Artificial IntelligenceSep-25-2022

Annotated data is an essential ingredient in natural language processing for training and evaluating machine learning models. It is therefore very desirable for the annotations to be of high quality. Recent work, however, has shown that several popular datasets contain a surprising amount of annotation errors or inconsistencies. To alleviate this issue, many methods for annotation error detection have been devised over the years. While researchers show that their approaches work well on their newly introduced datasets, they rarely compare their methods to previous work or on the same datasets. This raises strong concerns on methods' general performance and makes it difficult to asses their strengths and weaknesses. We therefore reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets for text classification as well as token and span labeling. In addition, we define a uniform evaluation setup including a new formalization of the annotation error detection task, evaluation protocol and general best practices. To facilitate future research and reproducibility, we release our datasets and implementations in an easy-to-use and open source software package.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

2206.0228

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Greater London > London > Wimbledon (0.04)
(48 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (1.00)
Leisure & Entertainment > Sports (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.67)
(2 more...)

Add feedback

SkIn: Skimming-Intensive Long-Text Classification Using BERT for Medical Corpus

Zhao, Yufeng, Che, Haiying

arXiv.org Artificial IntelligenceSep-24-2022

BERT is a widely used pre-trained model in natural language processing. However, since BERT is quadratic to the text length, the BERT model is difficult to be used directly on the long-text corpus. In some fields, the collected text data may be quite long, such as in the health care field. Therefore, to apply the pre-trained language knowledge of BERT to long text, in this paper, imitating the skimming-intensive reading method used by humans when reading a long paragraph, the Skimming-Intensive Model (SkIn) is proposed. It can dynamically select the critical information in the text so that the sentence input into the BERT-Base model is significantly shortened, which can effectively save the cost of the classification algorithm. Experiments show that the SkIn method has achieved superior accuracy than the baselines on long-text classification datasets in the medical field, while its time and space requirements increase linearly with the text length, alleviating the time and space overflow problem of basic BERT on long-text data.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

2209.05741

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.65)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.71)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)

Add feedback

IDEA: Interactive DoublE Attentions from Label Embedding for Text Classification

Wang, Ziyuan, Huang, Hailiang, Han, Songqiao

arXiv.org Artificial IntelligenceSep-23-2022

Current text classification methods typically encode the text merely into embedding before a naive or complicated classifier, which ignores the suggestive information contained in the label text. As a matter of fact, humans classify documents primarily based on the semantic meaning of the subcategories. We propose a novel model structure via siamese BERT and interactive double attentions named IDEA ( Interactive DoublE Attentions) to capture the information exchange of text and label names. Interactive double attentions enable the model to exploit the inter-class and intra-class information from coarse to fine, which involves distinguishing among all labels and matching the semantical subclasses of ground truth labels. Our proposed method outperforms the state-of-the-art methods using label texts significantly with more stable results.

machine learning, natural language, text classification, (13 more...)

arXiv.org Artificial Intelligence

2209.11407

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)
(3 more...)

Genre:

Research Report > Promising Solution (0.68)
Research Report > Experimental Study (0.47)

Industry:

Leisure & Entertainment > Sports (0.68)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.88)

Add feedback

SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

Nguyen, Luan Thanh, Van Nguyen, Kiet, Nguyen, Ngan Luu-Thuy

arXiv.org Artificial IntelligenceSep-21-2022

Text classification is a typical natural language processing or computational linguistics task with various interesting applications. As the number of users on social media platforms increases, data acceleration promotes emerging studies on Social Media Text Classification (SMTC) or social media text mining on these valuable resources. In contrast to English, Vietnamese, one of the low-resource languages, is still not concentrated on and exploited thoroughly. Inspired by the success of the GLUE, we introduce the Social Media Text Classification Evaluation (SMTCE) benchmark, as a collection of datasets and models across a diverse set of SMTC tasks. With the proposed benchmark, we implement and analyze the effectiveness of a variety of multilingual BERT-based models (mBERT, XLM-R, and DistilmBERT) and monolingual BERT-based models (PhoBERT, viBERT, vELECTRA, and viBERT4news) for tasks in the SMTCE benchmark. Monolingual models outperform multilingual models and achieve state-of-the-art results on all text classification tasks. It provides an objective assessment of multilingual and monolingual BERT-based models on the benchmark, which will benefit future studies about BERTology in the Vietnamese language.

machine learning, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

2209.10482

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
Asia > Vietnam > Hanoi > Hanoi (0.04)
(4 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

CINO: A Chinese Minority Pre-trained Language Model

Yang, Ziqing, Xu, Zihang, Cui, Yiming, Wang, Baoxin, Lin, Min, Wu, Dayong, Chen, Zhigang

arXiv.org Artificial IntelligenceSep-20-2022

Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the current multilingual models do not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages. It covers Standard Chinese, Yue Chinese, and six other ethnic minority languages. To evaluate the cross-lingual ability of the multilingual model on ethnic minority languages, we collect documents from Wikipedia and news websites, and construct two text classification datasets, WCM (Wiki-Chinese-Minority) and CMNews (Chinese-Minority-News). We show that CINO notably outperforms the baselines on various classification tasks. The CINO model and the datasets are publicly available at http://cino.hfl-rc.com.

machine learning, natural language, text classification, (19 more...)

arXiv.org Artificial Intelligence

2202.13558

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Asia > China > Hong Kong (0.04)
(14 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback