AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

A Debiased Nearest Neighbors Framework for Multi-Label Text Classification

Cheng, Zifeng, Jiang, Zhiwei, Yin, Yafeng, Chen, Zhaoling, Wang, Cong, Ge, Shiping, Huang, Qiguo, Gu, Qing

arXiv.org Artificial IntelligenceAug-6-2024

Multi-Label Text Classification (MLTC) is a practical yet challenging task that involves assigning multiple non-exclusive labels to each document. Previous studies primarily focus on capturing label correlations to assist label prediction by introducing special labeling schemes, designing specific model structures, or adding auxiliary tasks. Recently, the $k$ Nearest Neighbor ($k$NN) framework has shown promise by retrieving labeled samples as references to mine label co-occurrence information in the embedding space. However, two critical biases, namely embedding alignment bias and confidence estimation bias, are often overlooked, adversely affecting prediction performance. In this paper, we introduce a DEbiased Nearest Neighbors (DENN) framework for MLTC, specifically designed to mitigate these biases. To address embedding alignment bias, we propose a debiased contrastive learning strategy, enhancing neighbor consistency on label co-occurrence. For confidence estimation bias, we present a debiased confidence estimation strategy, improving the adaptive combination of predictions from $k$NN and inductive binary classifications. Extensive experiments conducted on four public benchmark datasets (i.e., AAPD, RCV1-V2, Amazon-531, and EUR-LEX57K) showcase the effectiveness of our proposed method. Besides, our method does not introduce any extra parameters.

contrastive learning, dataset, prediction, (12 more...)

arXiv.org Artificial Intelligence

2408.03202

Country: Asia > China > Jiangsu Province > Nanjing (0.05)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.83)
(3 more...)

Add feedback

Optimal and efficient text counterfactuals using Graph Neural Networks

Lymperopoulos, Dimitris, Lymperaiou, Maria, Filandrianos, Giorgos, Stamou, Giorgos

arXiv.org Artificial IntelligenceAug-4-2024

As NLP models become increasingly integral to decision-making processes, the need for explainability and interpretability has become paramount. In this work, we propose a framework that achieves the aforementioned by generating semantically edited inputs, known as counterfactual interventions, which change the model prediction, thus providing a form of counterfactual explanations for the model. We test our framework on two NLP tasks - binary sentiment classification and topic classification - and show that the generated edits are contrastive, fluent and minimal, while the whole process remains significantly faster that other state-of-the-art counterfactual editors.

computational linguistic, node, substitution, (16 more...)

arXiv.org Artificial Intelligence

2408.01969

Country:

North America > Canada > Ontario > Toronto (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Oregon (0.04)
(11 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.82)
(2 more...)

Add feedback

CERT-ED: Certifiably Robust Text Classification for Edit Distance

Huang, Zhuoqun, Marchant, Neil G, Ohrimenko, Olga, Rubinstein, Benjamin I. P.

arXiv.org Artificial IntelligenceAug-1-2024

With the growing integration of AI in daily life, ensuring the robustness of systems to inference-time attacks is crucial. Among the approaches for certifying robustness to such adversarial examples, randomized smoothing has emerged as highly promising due to its nature as a wrapper around arbitrary black-box models. Previous work on randomized smoothing in natural language processing has primarily focused on specific subsets of edit distance operations, such as synonym substitution or word insertion, without exploring the certification of all edit operations. In this paper, we adapt Randomized Deletion (Huang et al., 2023) and propose, CERTified Edit Distance defense (CERT-ED) for natural language classification. Through comprehensive experiments, we demonstrate that CERT-ED outperforms the existing Hamming distance method RanMASK (Zeng et al., 2023) in 4 out of 5 datasets in terms of both accuracy and the cardinality of the certificate. By covering various threat models, including 5 direct and 5 transfer attacks, our method improves empirical robustness in 38 out of 50 settings.

certificate, dataset, ranmask, (15 more...)

arXiv.org Artificial Intelligence

2408.00728

Country:

Asia > China > Hong Kong (0.04)
North America > Dominican Republic (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.82)

Add feedback

EuroCropsML: A Time Series Benchmark Dataset For Few-Shot Crop Type Classification

Reuss, Joana, Macdonald, Jan, Becker, Simon, Richter, Lorenz, Körner, Marco

arXiv.org Artificial IntelligenceJul-24-2024

We introduce EuroCropsML, an analysis-ready remote sensing machine learning dataset for time series crop type classification of agricultural parcels in Europe. It is the first dataset designed to benchmark transnational few-shot crop type classification algorithms that supports advancements in algorithmic development and research comparability. It comprises 706 683 multi-class labeled data points across 176 classes, featuring annual time series of per-parcel median pixel values from Sentinel-2 L1C data for 2021, along with crop type labels and spatial coordinates. Based on the open-source EuroCrops collection, EuroCropsML is publicly available on Zenodo.

dataset, method section, parcel, (13 more...)

arXiv.org Artificial Intelligence

2407.17458

Country:

Europe > Estonia (0.08)
Europe > Portugal (0.07)
Europe > Latvia (0.07)
(5 more...)

Genre: Research Report (0.50)

Industry:

Food & Agriculture > Agriculture (0.93)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.82)

Add feedback

A Novel Two-Step Fine-Tuning Pipeline for Cold-Start Active Learning in Text Classification Tasks

Belém, Fabiano, Cunha, Washington, França, Celso, Andrade, Claudio, Rocha, Leonardo, Gonçalves, Marcos André

arXiv.org Artificial IntelligenceJul-24-2024

This is the first work to investigate the effectiveness of BERT-based contextual embeddings in active learning (AL) tasks on cold-start scenarios, where traditional fine-tuning is infeasible due to the absence of labeled data. Our primary contribution is the proposal of a more robust fine-tuning pipeline - DoTCAL - that diminishes the reliance on labeled data in AL using two steps: (1) fully leveraging unlabeled data through domain adaptation of the embeddings via masked language modeling and (2) further adjusting model weights using labeled data selected by AL. Our evaluation contrasts BERT-based embeddings with other prevalent text representation paradigms, including Bag of Words (BoW), Latent Semantic Indexing (LSI), and FastText, at two critical stages of the AL process: instance selection and classification. Experiments conducted on eight ATC benchmarks with varying AL budgets (number of labeled instances) and number of instances (about 5,000 to 300,000) demonstrate DoTCAL's superior effectiveness, achieving up to a 33% improvement in Macro-F1 while reducing labeling efforts by half compared to the traditional one-step method. We also found that in several tasks, BoW and LSI (due to information aggregation) produce results superior (up to 59% ) to BERT, especially in low-budget scenarios and hard-to-classify tasks, which is quite surprising.

dataset, effectiveness, representation, (12 more...)

arXiv.org Artificial Intelligence

2407.17284

Country:

South America > Brazil > Minas Gerais (0.05)
Asia > China (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre:

Research Report > New Finding (1.00)
Workflow (0.87)
Research Report > Experimental Study (0.68)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.66)
(2 more...)

Add feedback

Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data

Schelb, Julian, Ulloa, Roberto, Spitz, Andreas

arXiv.org Artificial IntelligenceJul-23-2024

Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.

classification, classifier, webpage, (15 more...)

arXiv.org Artificial Intelligence

2407.16516

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(15 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Law (0.68)
Information Technology (0.67)
Energy > Renewable (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(5 more...)

Add feedback

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

Pelloin, Valentin, Dodson, Lena, Chapuis, Émile, Hervé, Nicolas, Doukhan, David

arXiv.org Artificial IntelligenceJul-19-2024

This paper introduces a computational framework designed to delineate gender distribution biases in topics covered by French TV and radio news. We transcribe a dataset of 11.7k hours, broadcasted in 2023 on 21 French channels. A Large Language Model (LLM) is used in few-shot conversation mode to obtain a topic classification on those transcriptions. Using the generated LLM annotations, we explore the finetuning of a specialized smaller classification model, to reduce the computational cost. To evaluate the performances of these models, we construct and annotate a dataset of 804 dialogues. This dataset is made available free of charge for research purposes. We show that women are notably underrepresented in subjects such as sports, politics and conflicts. Conversely, on topics such as weather, commercials and health, women have more speaking time than their overall average across all subjects. We also observe representations differences between private and public service channels.

category, dataset, dialogue, (14 more...)

arXiv.org Artificial Intelligence

2407.1418

Country:

North America > Dominican Republic (0.04)
Europe > Ireland (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Media > News (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.85)

Add feedback

dzStance at StanceEval2024: Arabic Stance Detection based on Sentence Transformers

Lichouri, Mohamed, Lounnas, Khaled, Ouaras, Khelil Rafik, Abi, Mohamed, Guechtouli, Anis

arXiv.org Artificial IntelligenceJul-18-2024

This study compares Term Frequency-Inverse Document Frequency (TF-IDF) features with Sentence Transformers for detecting writers' stances--favorable, opposing, or neutral--towards three significant topics: COVID-19 vaccine, digital transformation, and women empowerment. Through empirical evaluation, we demonstrate that Sentence Transformers outperform TF-IDF features across various experimental setups. Our team, dzStance, participated in a stance detection competition, achieving the 13th position (74.91%) among 15 teams in Women Empowerment, 10th (73.43%) in COVID Vaccine, and 12th (66.97%) in Digital Transformation. Overall, our team's performance ranked 13th (71.77%) among all participants. Notably, our approach achieved promising F1-scores, highlighting its effectiveness in identifying writers' stances on diverse topics. These results underscore the potential of Sentence Transformers to enhance stance detection models for addressing critical societal issues.

detection, sentence transformer, stance detection, (14 more...)

arXiv.org Artificial Intelligence

2407.13603

Country:

Africa > Middle East > Algeria > Algiers Province > Algiers (0.05)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)

Genre: Research Report > New Finding (0.50)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.49)

Add feedback

Decoding AI and Human Authorship: Nuances Revealed Through NLP and Statistical Analysis

Akinwande, Mayowa, Adeliyi, Oluwaseyi, Yussuph, Toyyibat

arXiv.org Artificial IntelligenceJul-15-2024

This research explores the nuanced differences in texts produced by AI and those written by humans, aiming to elucidate how language is expressed differently by AI and humans. Through comprehensive statistical data analysis, the study investigates various linguistic traits, patterns of creativity, and potential biases inherent in human-written and AI- generated texts. The significance of this research lies in its contribution to understanding AI's creative capabilities and its impact on literature, communication, and societal frameworks. By examining a meticulously curated dataset comprising 500K essays spanning diverse topics and genres, generated by LLMs, or written by humans, the study uncovers the deeper layers of linguistic expression and provides insights into the cognitive processes underlying both AI and human-driven textual compositions. The analysis revealed that human-authored essays tend to have a higher total word count on average than AI-generated essays but have a shorter average word length compared to AI- generated essays, and while both groups exhibit high levels of fluency, the vocabulary diversity of Human authored content is higher than AI generated content. However, AI- generated essays show a slightly higher level of novelty, suggesting the potential for generating more original content through AI systems. The paper addresses challenges in assessing the language generation capabilities of AI models and emphasizes the importance of datasets that reflect the complexities of human-AI collaborative writing. Through systematic preprocessing and rigorous statistical analysis, this study offers valuable insights into the evolving landscape of AI-generated content and informs future developments in natural language processing (NLP).

cybernetic & informatic, human-authored essay, international journal, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijci.2024.130408

2408.00769

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Arkansas (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Banking & Finance (1.00)
Education > Educational Setting > Higher Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(7 more...)

Add feedback

Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification

Liu, Tengfei, Hu, Yongli, Gao, Junbin, Sun, Yanfeng, Yin, Baocai

arXiv.org Artificial IntelligenceJul-14-2024

Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have only focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features. Furthermore, we introduce a new interaction strategy called the dynamic mask transfer module to integrate these two transformers by propagating features between them. To validate our approach, we conduct cross-modal LDC experiments on two newly created and two publicly available multi-modal long document datasets, and the results show that the proposed HMT outperforms state-of-the-art single-modality and multi-modality methods.

long document, proceedings, transformer, (15 more...)

arXiv.org Artificial Intelligence

2407.10105

Country:

Asia > China > Beijing > Beijing (0.04)
Oceania > Australia (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)

Industry: Energy (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.92)

Add feedback