AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

Leveraging BERT Language Model for Arabic Long Document Classification

AL-Qurishi, Muhammad

arXiv.org Artificial IntelligenceMay-4-2023

Given the number of Arabic speakers worldwide and the notably large amount of content in the web today in some fields such as law, medicine, or even news, documents of considerable length are produced regularly. Classifying those documents using traditional learning models is often impractical since extended length of the documents increases computational requirements to an unsustainable level. Thus, it is necessary to customize these models specifically for long textual documents. In this paper we propose two simple but effective models to classify long length Arabic documents. We also fine-tune two different models-namely, Longformer and RoBERT, for the same task and compare their results to our models. Both of our models outperform the Longformer and RoBERT in this task over two different datasets.

machine learning, natural language, text classification, (16 more...)

arXiv.org Artificial Intelligence

2305.03519

Country:

Asia > Middle East > Saudi Arabia > Riyadh Province > Riyadh (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.55)

Add feedback

Tuning Traditional Language Processing Approaches for Pashto Text Classification

Baktash, Jawid Ahmad, Dawodi, Mursal, Joya, Mohammad Zarif, Hassanzada, Nematullah

arXiv.org Artificial IntelligenceMay-4-2023

Today text classification becomes critical task for concerned individuals for numerous purposes. Hence, several researches have been conducted to develop automatic text classification for national and international languages. However, the need for an automatic text categorization system for local languages is felt. The main aim of this study is to establish a Pashto automatic text classification system. In order to pursue this work, we built a Pashto corpus which is a collection of Pashto documents due to the unavailability of public datasets of Pashto text documents. Besides, this study compares several models containing both statistical and neural network machine learning techniques including Multilayer Perceptron (MLP), Support Vector Machine (SVM), K Nearest Neighbor (KNN), decision tree, gaussian na\"ive Bayes, multinomial na\"ive Bayes, random forest, and logistic regression to discover the most effective approach. Moreover, this investigation evaluates two different feature extraction methods including unigram, and Time Frequency Inverse Document Frequency (IFIDF). Subsequently, this research obtained average testing accuracy rate 94% using MLP classification algorithm and TFIDF feature extraction method in this context.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijci.2023.120222

2305.03737

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Germany (0.04)
(10 more...)

Genre: Research Report > Experimental Study (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
(3 more...)

Add feedback

Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering

Wu, Zhiyong, Wang, Yaoxiang, Ye, Jiacheng, Kong, Lingpeng

arXiv.org Artificial IntelligenceMay-3-2023

Despite the surprising few-shot performance of in-context learning (ICL), it is still a common practice to randomly sample examples to serve as context. This paper advocates a new principle for ICL: self-adaptive in-context learning. The self-adaption mechanism is introduced to help each sample find an in-context example permutation (i.e., selection and ordering) that can derive the correct prediction, thus maximizing performance. To validate the effectiveness of self-adaptive ICL, we propose a general select-then-rank framework and instantiate it with new selection and ranking algorithms. Upon extensive evaluation on eight different NLP datasets, our self-adaptive ICL method achieves a 40% relative improvement over the common practice setting. Further analysis reveals the enormous potential of self-adaptive ICL that it might be able to close the gap between ICL and finetuning given more advanced algorithms. Our code is released to facilitate future research in this area: https://github.com/Shark-NLP/self-adaptive-ICL

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2212.10375

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.47)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Psychologically-Inspired Causal Prompts

Lyu, Zhiheng, Jin, Zhijing, Mattern, Justus, Mihalcea, Rada, Sachan, Mrinmaya, Schoelkopf, Bernhard

arXiv.org Artificial IntelligenceMay-2-2023

NLP datasets are richer than just input-output pairs; rather, they carry causal relations between the input and output variables. In this work, we take sentiment classification as an example and look into the causal relations between the review (X) and sentiment (Y). As psychology studies show that language can affect emotion, different psychological processes are evoked when a person first makes a rating and then self-rationalizes their feeling in a review (where the sentiment causes the review, i.e., Y -> X), versus first describes their experience, and weighs the pros and cons to give a final rating (where the review causes the sentiment, i.e., X -> Y ). Furthermore, it is also a completely different psychological process if an annotator infers the original rating of the user by theory of mind (ToM) (where the review causes the rating, i.e., X -ToM-> Y ). In this paper, we verbalize these three causal mechanisms of human psychological processes of sentiment classification into three different causal prompts, and study (1) how differently they perform, and (2) what nature of sentiment classification data leads to agreement or diversity in the model responses elicited by the prompts. We suggest future work raise awareness of different causal structures in NLP tasks. Our code and data are at https://github.com/cogito233/psych-causal-prompt

large language model, natural language, text classification, (18 more...)

arXiv.org Artificial Intelligence

2305.01764

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States > Michigan (0.04)
Asia > China > Hong Kong (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.93)
Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.89)

Add feedback

Graph Neural Networks for Text Classification: A Survey

Wang, Kunze, Ding, Yihao, Han, Soyeon Caren

arXiv.org Artificial IntelligenceApr-27-2023

Text Classification is the most essential and fundamental problem in Natural Language Processing. While numerous recent text classification models applied the sequential deep learning technique, graph neural network-based models can directly deal with complex structured text data and exploit global information. Many real text classification applications can be naturally cast into a graph, which captures words, documents, and corpus global features. In this survey, we bring the coverage of methods up to 2023, including corpus-level and document-level graph neural networks. We discuss each of these methods in detail, dealing with the graph construction mechanisms and the graph-based learning process. As well as the technological survey, we look at issues behind and future directions addressed in text classification using graph neural networks. We also cover datasets, evaluation metrics, and experiment design and present a summary of published performance on the publicly available benchmarks. Note that we present a comprehensive comparison between different techniques and identify the pros and cons of various evaluation metrics in this survey.

classification, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2304.11534

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
Asia > Middle East > Jordan (0.04)
Oceania > Australia > Western Australia (0.04)
(10 more...)

Genre: Overview (1.00)

Industry:

Information Technology (0.67)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Is augmentation effective to improve prediction in imbalanced text datasets?

Assunção, Gabriel O., Izbicki, Rafael, Prates, Marcos O.

arXiv.org Artificial IntelligenceApr-20-2023

Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used in natural language processing (NLP) to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is always necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2304.10283

Country:

North America > United States > Iowa (0.05)
South America > Brazil > Minas Gerais > Belo Horizonte (0.04)
North America > United States > California (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

A Two-Stage Framework with Self-Supervised Distillation For Cross-Domain Text Classification

Feng, Yunlong, Li, Bohan, Qin, Libo, Xu, Xiao, Che, Wanxiang

arXiv.org Artificial IntelligenceApr-18-2023

Cross-domain text classification aims to adapt models to a target domain that lacks labeled data. It leverages or reuses rich labeled data from the different but related source domain(s) and unlabeled data from the target domain. To this end, previous work focuses on either extracting domain-invariant features or task-agnostic features, ignoring domain-aware features that may be present in the target domain and could be useful for the downstream task. In this paper, we propose a two-stage framework for cross-domain text classification. In the first stage, we finetune the model with mask language modeling (MLM) and labeled data from the source domain. In the second stage, we further fine-tune the model with self-supervised distillation (SSD) and unlabeled data from the target domain. We evaluate its performance on a public cross-domain text classification benchmark and the experiment results show that our method achieves new state-of-the-art results for both single-source domain adaptations (94.17% $\uparrow$1.03%) and multi-source domain adaptations (95.09% $\uparrow$1.34%).

machine learning, natural language, text classification, (15 more...)

arXiv.org Artificial Intelligence

2304.0982

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
North America > Dominican Republic (0.04)
(6 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Label Dependencies-aware Set Prediction Networks for Multi-label Text Classification

Quanjie, Han, Xinkai, Du, Yalin, Sun, Chao, Lv

arXiv.org Artificial IntelligenceApr-14-2023

Multi-label text classification aims to extract all the related labels from a sentence, which can be viewed as a sequence generation problem. However, the labels in training dataset are unordered. We propose to treat it as a direct set prediction problem and don't need to consider the order of labels. Besides, in order to model the correlation between labels, the adjacency matrix is constructed through the statistical relations between labels and GCN is employed to learn the label information. Based on the learned label information, the set prediction networks can both utilize the sentence information and label information for multi-label text classification simultaneously. Furthermore, the Bhattacharyya distance is imposed on the output probability distributions of the set prediction networks to increase the recall ability. Experimental results on four multi-label datasets show the effectiveness of the proposed method and it outperforms previous method a substantial margin.

classification, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2304.07022

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Addressing contingency in algorithmic (mis)information classification: Toward a responsible machine learning agenda

Hernández, Andrés Domínguez, Owen, Richard, Nielsen, Dan Saattrup, McConville, Ryan

arXiv.org Artificial IntelligenceApr-13-2023

Machine learning (ML) enabled classification models are becoming increasingly popular for tackling the sheer volume and speed of online misinformation and other content that could be identified as harmful. In building these models, data scientists need to take a stance on the legitimacy, authoritativeness and objectivity of the sources of ``truth" used for model training and testing. This has political, ethical and epistemic implications which are rarely addressed in technical papers. Despite (and due to) their reported high accuracy and performance, ML-driven moderation systems have the potential to shape online public debate and create downstream negative impacts such as undue censorship and the reinforcing of false beliefs. Using collaborative ethnography and theoretical insights from social studies of science and expertise, we offer a critical analysis of the process of building ML models for (mis)information classification: we identify a series of algorithmic contingencies--key moments during model development that could lead to different future outcomes, uncertainty and harmful effects as these tools are deployed by social media platforms. We conclude by offering a tentative path toward reflexive and responsible development of ML tools for moderating misinformation and other harmful content online.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1080/23299460.2023.2222514

2210.09014

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Ukraine (0.04)
(10 more...)

Genre: Research Report (1.00)

Industry:

Media > News (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Health & Medicine > Therapeutic Area > Immunology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.60)

Add feedback

Julian Assange's family grills government's 'over-classification' of documents: 'A problem for democracy'

FOX NewsApr-9-2023, 18:00:59 GMT

Fox Nation host Piers Morgan talks to Julian Assange's brother and father about this role in leaking classified military documents. People can never seem to agree on which label to give infamous info leaker Julian Assange. The WikiLeaks founder accused of publishing classified U.S. military information looks at up to 175 years in prison if extradited to the U.S. from his current location in a high-security U.K. prison. His father and brother are among those heralding him as a hero, reiterating their belief in a recent appearance on Fox Nation's "Piers Morgan: Uncensored." "Everything that Julian published was in the public interest and he partnered with these media organizations… so you're talking about all the largest media organizations around the world that published this exact same information," Gabriel Shipton, Assange's brother, said. JULIAN ASSANGE'S BROTHER AND FATHER SPEAK OUT OVER HIS DETAINMENT, CALL FOR CHARGES TO BE DROPPED WikiLeaks founder Julian Assange pauses as he makes a statement to media gathered outside the High Court in London, on Monday, Dec. 5, 2011.

assange, natural language, text classification, (9 more...)

FOX News

Country:

North America > Canada > Ontario > Middlesex County > London (0.26)
North America > United States > Michigan (0.06)

Industry:

Law > Civil Rights & Constitutional Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.42)

Add feedback