AITopics

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.64)

Nguyen, Quang Anh, Tomeh, Nadi, Lebbah, Mustapha, Charnois, Thierry, Azzag, Hanene, Muñoz, Santiago Cordoba

Manual Verbalizer Enrichment for Few-Shot Text Classification

arXiv.org Artificial IntelligenceOct-8-2024

With the continuous development of pre-trained language models, prompt-based training becomes a well-adopted paradigm that drastically improves the exploitation of models for many natural language processing tasks. Prompting also shows great performance compared to traditional fine-tuning when adapted to zero-shot or few-shot scenarios where the number of annotated data is limited. In this framework, the role of verbalizers is essential, as an interpretation from masked word distributions into output predictions. In this work, we propose \acrshort{mave}, an approach for verbalizer construction by enrichment of class labels using neighborhood relation in the embedding space of words for the text classification task. In addition, we elaborate a benchmarking procedure to evaluate typical baselines of verbalizers for document classification in few-shot learning contexts. Our model achieves state-of-the-art results while using significantly fewer resources. We show that our approach is particularly effective in cases with extremely limited supervision data.

artificial intelligence, natural language, text classification, (18 more...)

2410.06173

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (0.68)
Media (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)

Neural Information Processing SystemsOct-7-2024, 14:11:28 GMT

Reviews: Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification

Summary: This paper proposes a new reduction from multi-class classification to binary classification that is especially suitable when the number of classes is very large. They consider a hypothesis that map (input,class) pairs to scores, and the underlying loss function counts the fraction of the wrong classes that are scored higher than the true class. More specifically, they suppose they have a feature transformation phi that maps (input,class) pairs to a p-dimensional feature space, and they learn a mapping from R p to scores. Their reduction extends the work of Joshi et al. (2015) which, for each data point (x,y), creates K-1 transformed points where each transformed point intuitively corresponds to the comparison of label y with some incorrect label y'. Given that the transformed dataset contains correlated training examples, many standard generalization bounds cannot be applied.

generalization, reduction, text classification, (9 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.58)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.40)

Neural Information Processing SystemsOct-7-2024, 06:36:35 GMT

Reviews: Diffusion Maps for Textual Network Embedding

The main idea of this paper is to use the diffusion convolutional operator to learn text embedding that takes into account the global influence of the whole graph. It then incorporates the diffusion process in the loss function to capture high-order proximity. In contrast, previous works either neglect the semantic distance indicated from the graph, or fails to take into account the similarities of context influenced by global structural information. The author then conducts experiments on the task of multi-label classification of text and link prediction and shows that the proposed model outperforms the baselines. Strength: The high level idea of of this paper is good, and the method is novel.

diffusion convolution operator, equation, textual network embedding, (11 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.52)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.38)

Azeemi, Abdul Hameed, Qazi, Ihsan Ayyub, Raza, Agha Ali

Language Model-Driven Data Pruning Enables Efficient Active Learning

arXiv.org Artificial IntelligenceOct-5-2024

Active learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. A key component in this procedure is an acquisition function that guides the selection process and identifies the suitable instances for labeling from the unlabeled pool. However, these acquisition methods suffer from high computational costs with large unlabeled data pools, posing a roadblock to their applicability on large datasets. To address this challenge and bridge this gap, we introduce a novel plug-and-play unlabeled data pruning strategy, ActivePrune, which leverages language models to prune the unlabeled pool. ActivePrune implements a two-stage pruning process: an initial fast evaluation using perplexity scores from an n-gram language model, followed by a high-quality selection using metrics for data quality computed through a quantized LLM. Additionally, to enhance the diversity in the unlabeled pool, we propose a novel perplexity reweighting method that systematically brings forward underrepresented instances for selection in subsequent labeling iterations. Experiments on translation, sentiment analysis, topic classification, and summarization tasks on four diverse datasets and four active learning strategies demonstrate that ActivePrune outperforms existing data pruning methods. Finally, we compare the selection quality $\leftrightarrow$ efficiency tradeoff of the data pruning methods and demonstrate that ActivePrune is computationally more efficient than other LLM score-based pruning methods, and provides up to 74% reduction in the end-to-end time required for active learning.

large language model, machine learning, natural language, (17 more...)

2410.04275

Country:

North America > United States (0.46)
Asia (0.28)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)
(2 more...)

Antypas, Dimosthenis, Ushio, Asahi, Barbieri, Francesco, Camacho-Collados, Jose

Multilingual Topic Classification in X: Dataset and Analysis

arXiv.org Artificial IntelligenceOct-3-2024

In the dynamic realm of social media, diverse topics are discussed daily, transcending linguistic boundaries. However, the complexities of understanding and categorising this content across various languages remain an important challenge with traditional techniques like topic modelling often struggling to accommodate this multilingual diversity. In this paper, we introduce X-Topic, a multilingual dataset featuring content in four distinct languages (English, Spanish, Japanese, and Greek), crafted for the purpose of tweet topic classification. Our dataset includes a wide range of topics, tailored for social media content, making it a valuable resource for scientists and professionals working on cross-linguistic analysis, the development of robust multilingual models, and computational scientists studying online dialogue. Finally, we leverage X-Topic to perform a comprehensive cross-linguistic and multilingual analysis, and compare the capabilities of current general- and domain-specific language models.

large language model, machine learning, natural language, (20 more...)

2410.03075

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry:

Media (0.93)
Health & Medicine (0.68)
Information Technology > Services (0.67)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.71)

Roy, Sudipta Singha, Wang, Xindi, Mercer, Robert E., Rudzicz, Frank

Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification

arXiv.org Artificial IntelligenceOct-3-2024

Long document classification presents challenges in capturing both local and global dependencies due to their extensive content and complex structure. Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents. To address these constraints, we propose a novel model leveraging a graph-tree structure. Our approach integrates syntax trees for sentence encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts, respectively. We use Tree Transformers to generate sentence encodings, while a graph attention network models inter- and intra-sentence dependencies. During training, we implement bidirectional information propagation from word-to-sentence-to-document and vice versa, which enriches the contextual representation. Our proposed method enables a comprehensive understanding of content at all hierarchical levels and effectively handles arbitrarily long contexts without token limit constraints. Experimental results demonstrate the effectiveness of our approach in all types of long document classification tasks.

classification, large language model, machine learning, (19 more...)

2410.0293

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Ontario > Toronto (0.14)
(12 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Neural Information Processing SystemsOct-2-2024, 19:55:35 GMT

Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification

Bikash Joshi, Massih R. Amini, Ioannis Partalas, Franck Iutzeler, Yury Maximov

We address the problem of multi-class classification in the case where the number of classes is very large. We propose a double sampling strategy on top of a multi-class to binary reduction strategy, which transforms the original multi-class problem into a binary classification problem over pairs of examples. The aim of the sampling strategy is to overcome the curse of long-tailed class distributions exhibited in majority of large-scale multi-class classification problems and to reduce the number of pairs of examples in the expanded data. We show that this strategy does not alter the consistency of the empirical risk minimization principle defined over the double sample reduction. Experiments are carried out on DMOZ and Wikipedia collections with 10,000 to 100,000 classes where we show the efficiency of the proposed approach in terms of training and prediction time, memory consumption, and predictive performance with respect to state-of-the-art approaches.

classification, dataset, predictive performance, (15 more...)

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(5 more...)

Genre: Research Report (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Li, Zhijian, Larson, Stefan, Leach, Kevin

Document Type Classification using File Names

arXiv.org Artificial IntelligenceOct-1-2024

Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification. Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets and computational resources associated with analyzing whole documents. In this paper, we present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method, to accurately and efficiently classify documents based solely on file names that substantially reduces inference time. This approach can distinguish ambiguous file names from the indicative file names through confidence scores and through using a negative class representing ambiguous file names. Our results indicate that file name classifiers can process more than 80% of the in-scope data with 96.7% accuracy when tested on a dataset with a large portion of out-of-scope data with respect to the training dataset while being 442.43x faster than more complex models such as DiT. Our method offers a crucial solution for efficiently processing vast datasets in critical scenarios, enabling fast, more reliable document classification.

classifier, dataset, file name, (15 more...)

2410.01166

Genre: Research Report > New Finding (0.88)

Industry:

Information Technology > Security & Privacy (0.48)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Waldis, Andreas, Birrer, Joel, Lauscher, Anne, Gurevych, Iryna

The Lou Dataset -- Exploring the Impact of Gender-Fair Language in German Text Classification

arXiv.org Artificial IntelligenceSep-26-2024

Nevertheless, there is a significant lack of resources to assess the impact of this linguistic shift on classification using language models (LMs), which are probably not trained on such variations. To address this gap, we present Lou, the first dataset featuring high-quality reformulations for German text classification covering seven tasks, like stance detection and toxicity classification. Evaluating 16 mono-and multi-lingual LMs on Lou shows that genderfair language substantially impacts predictions by flipping labels, reducing certainty, and altering attention patterns. However, existing evaluations remain valid, as LM rankings of Figure 1: A German stance detection instance from the original and reformulated instances do not significantly Lou dataset. We reformulate the masculine formulation differ. While we offer initial insights Konsumenten (consumers) regarding six inclusive or on the effect on German text classification, the neutral strategies, highlighted in yellow. Translation: findings likely apply to other languages, as consistent Consumers must be well supported.

computational linguistic, gender-fair language, reformulation, (12 more...)

2409.17929

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
(18 more...)

Genre: Research Report > New Finding (0.46)

Industry: Government > Regional Government > Europe Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)