AITopics | Text Classification

Collaborating Authors

Text Classification

"A text classifier is an automated means of determining some metadata about a document. Text classifiers are used for such diverse needs as spam filtering, suggesting categories for indexing a document created in a content management system, or automatically sorting help desk requests."
– John Graham-Cumming, Naive Bayesian Text Classification. Dr. Dobb's. May 1 2005.

News Overviews Instructional Materials AI-Alerts Classics

InsurTech innovation using natural language processing

Dong, Panyi, Quan, Zhiyu

arXiv.org Machine LearningJul-30-2025

InsurTech refers to the use of state-of-the-art technology, including both emerging hardware and software, to address inefficiencies across the insurance value chain and further explore new opportunities to reshape traditional business operations. InsurTech encompasses a broad spectrum of technology-driven innovations, including, but not limited to, telematics, usage-based insurance, and the integration of Internet of Things (IoT) sensors. In this study, we focus on a specific class of InsurTech, an Insurtech data vendor, that provides insurance companies with next-generation data solutions. We leverage new and diverse external data sources, such as social media data and online content, to enrich the internal database, thereby empowering actuarial analytics and gaining more accurate insights into risk profiles and policyholder behavior. Specifically, by integrating alternative data sources beyond traditional information, insurance companies can uncover previously unrecognized risk factors, reduce bias in existing features, and identify more accurate risk exposures based on the operational characteristics of the insured entities.

information retrieval, large language model, machine learning, (24 more...)

arXiv.org Machine Learning

2507.21112

Country:

North America > United States > California (0.05)
North America > United States > Illinois > Cook County > Chicago (0.05)
North America > United States > New Jersey (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Banking & Finance > Insurance (1.00)
Government > Regional Government > North America Government > United States Government (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(7 more...)

Add feedback

Can human clinical rationales improve the performance and explainability of clinical text classification models?

Metzner, Christoph, Gao, Shang, Herrmannova, Drahomira, Hanson, Heidi A.

arXiv.org Artificial IntelligenceJul-30-2025

AI-driven clinical text classification is vital for explainable automated retrieval of population-level health information. This work investigates whether human-based clinical rationales can serve as additional supervision to improve both performance and explainability of transformer-based models that automatically encode clinical documents. We analyzed 99,125 human-based clinical rationales that provide plausible explanations for primary cancer site diagnoses, using them as additional training samples alongside 128,649 electronic pathology reports to evaluate transformer-based models for extracting primary cancer sites. We also investigated sufficiency as a way to measure rationale quality for pre-selecting rationales. Our results showed that clinical rationales as additional training data can improve model performance in high-resource scenarios but produce inconsistent behavior when resources are limited. Using sufficiency as an automatic metric to preselect rationales also leads to inconsistent results. Importantly, models trained on rationales were consistently outperformed by models trained on additional reports instead. This suggests that clinical rationales don't consistently improve model performance and are outperformed by simply using more reports. Therefore, if the goal is optimizing accuracy, annotation efforts should focus on labeling more reports rather than creating rationales. However, if explainability is the priority, training models on rationale-supplemented data may help them better identify rationale-like features. We conclude that using clinical rationales as additional training data results in smaller performance improvements and only slightly better explainability (measured as average token-level rationale coverage) compared to training on additional reports.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.21302

Country: North America > United States > Tennessee (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

A Survey of Classification Tasks and Approaches for Legal Contracts

Singh, Amrita, Joshi, Aditya, Jiang, Jiaojiao, Paik, Hye-young

arXiv.org Artificial IntelligenceJul-30-2025

Given the large size and volumes of contracts and their underlying inherent complexity, manual reviews become inefficient and prone to errors, creating a clear need for automation. Automatic Legal Contract Classification (LCC) revolutionizes the way legal contracts are analyzed, offering substantial improvements in speed, accuracy, and accessibility. This survey delves into the challenges of automatic LCC and a detailed examination of key tasks, datasets, and methodologies. We identify seven classification tasks within LCC, and review fourteen datasets related to English-language contracts, including public, proprietary, and non-public sources. We also introduce a methodology taxonomy for LCC, categorized into Traditional Machine Learning, Deep Learning, and Transformer-based approaches. Additionally, the survey discusses evaluation techniques and highlights the best-performing results from the reviewed studies. By providing a thorough overview of current methods and their limitations, this survey suggests future research directions to improve the efficiency, accuracy, and scalability of LCC. As the first comprehensive survey on LCC, it aims to support legal NLP researchers and practitioners in improving legal processes, making legal information more accessible, and promoting a more informed and equitable society.

classification, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2507.21108

Country:

North America > United States (1.00)
Asia (0.92)
Europe > United Kingdom > England (0.27)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Research Report > Experimental Study (0.93)
Research Report > Promising Solution (0.92)

Industry:

Law > Business Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law > Statutes (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(4 more...)

Add feedback

Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs

Lopez-Duran, Miguel, Fierrez, Julian, Morales, Aythami, Tolosana, Ruben, Delgado-Mohatar, Oscar, Ortigosa, Alvaro

arXiv.org Artificial IntelligenceJul-29-2025

The automatic analysis of document layouts in digital-born PDF documents remains a challenging problem due to the heterogeneous arrangement of textual and nontextual elements and the imprecision of the textual metadata in the Portable Document Format. In this work, we benchmark Graph Neural Network (GNN) architectures for the task of fine-grained layout classification of text blocks from digital native documents. We introduce two graph construction structures: a k-closest-neighbor graph and a fully connected graph, and generate node features via pre-trained text and vision models, thus avoiding manual feature engineering. Three experimental frameworks are evaluated: single-modality (text or visual), concatenated multimodal, and dual-branch multimodal. We evaluated four foundational GNN models and compared them with the baseline. Our experiments are specifically conducted on a rich dataset of public affairs documents that includes more than 20 sources (e.g., regional and national-level official gazettes), 37K PDF documents, with 441K pages in total. Our results demonstrate that GraphSAGE operating on the k-closest-neighbor graph in a dual-branch configuration achieves the highest per-class and overall accuracy, outperforming the baseline in some sources. These findings confirm the importance of local layout relationships and multimodal fusion exploited through GNNs for the analysis of native digital document layouts.

machine learning, natural language, text classification, (17 more...)

arXiv.org Artificial Intelligence

2505.14699

Country:

Europe > Spain (0.68)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.88)

Industry:

Information Technology (0.68)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.87)

Add feedback

Combining Language and Topic Models for Hierarchical Text Classification

Toit, Jaco du, Dunaiski, Marcel

arXiv.org Artificial IntelligenceJul-23-2025

Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC approach which uses a PLM and a topic model to extract features from text documents which are used to train a classification model. Our objective is to determine whether the combination of the features extracted from the two models is beneficial to HTC performance in general. In our approach, the extracted features are passed through separate convolutional layers whose outputs are combined and passed to a label-wise attention mechanisms which obtains label-specific document representations by weighing the most important features for each class separately. We perform comprehensive experiments on three HTC benchmark datasets and show that using the features extracted from the topic model generally decreases classification performance compared to only using the features obtained by the PLM. In contrast to previous work, this shows that the incorporation of features extracted from topic models for text classification tasks should not be assumed beneficial.

machine learning, natural language, text classification, (14 more...)

arXiv.org Artificial Intelligence

2507.1649

Country:

Asia (0.93)
North America > United States > California (0.68)
Europe (0.67)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification

Khatuya, Subhendu, Naidu, Shashwat, Ghosh, Saptarshi, Goyal, Pawan, Ganguly, Niloy

arXiv.org Artificial IntelligenceJul-22-2025

The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of treating labels as mere atomic symbols, our approach utilizes predefined label descriptions and is trained to generate these descriptions based on the input text. During inference, the generated descriptions are matched to the pre-defined labels using a finetuned sentence transformer. We integrate this with a dual-objective loss function, combining cross-entropy loss and cosine similarity of the generated sentences with the predefined target descriptions, ensuring both semantic alignment and accuracy. Our proposed model LAGAMC stands out for its parameter efficiency and versatility across diverse datasets, making it well-suited for practical applications. We demonstrate the effectiveness of our proposed model by achieving new state-of-the-art performances across all evaluated datasets, surpassing several strong baselines. We achieve improvements of 13.94% in Micro-F1 and 24.85% in Macro-F1 compared to the closest baseline across all datasets.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2506.06806

Country:

Asia (0.46)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

Signs of the Past, Patterns of the Present: On the Automatic Classification of Old Babylonian Cuneiform Signs

Verwimp, Eli, Smidt, Gustav Ryberg, Hameeuw, Hendrik, De Graef, Katrien

arXiv.org Artificial IntelligenceJul-21-2025

The work in this paper describes the training and evaluation of machine learning (ML) techniques for the classification of cuneiform signs. There is a lot of variability in cuneiform signs, depending on where they come from, for what and by whom they were written, but also how they were digitized. This variability makes it unlikely that an ML model trained on one dataset will perform successfully on another dataset. This contribution studies how such differences impact that performance. Based on our results and insights, we aim to influence future data acquisition standards and provide a solid foundation for future cuneiform sign classification tasks. The ML model has been trained and tested on handwritten Old Babylonian (c. 2000-1600 B.C.E.) documentary texts inscribed on clay tablets originating from three Mesopotamian cities (Nippur, Dūr-Abiešuh and Sippar). The presented and analysed model is ResNet50, which achieves a top-1 score of 87.1% and a top-5 score of 96.5% for signs with at least 20 instances. As these automatic classification results are the first on Old Babylonian texts, there are currently no comparable results.

machine learning, natural language, text classification, (19 more...)

arXiv.org Artificial Intelligence

2507.13959

Country:

Europe > Belgium > Flanders (0.14)
North America > United States > Illinois (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Political Leaning and Politicalness Classification of Texts

Volf, Matous, Simko, Jakub

arXiv.org Artificial IntelligenceJul-21-2025

This paper addresses the challenge of automatically classifying text according to political leaning and politicalness using transformer models. We compose a comprehensive overview of existing datasets and models for these tasks, finding that current approaches create siloed solutions that perform poorly on out-of-distribution texts. To address this limitation, we compile a diverse dataset by combining 12 datasets for political leaning classification and creating a new dataset for politicalness by extending 18 existing datasets with the appropriate label. Through extensive benchmarking with leave-one-in and leave-one-out methodologies, we evaluate the performance of existing models and train new ones with enhanced generalization capabilities.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.13913

Country:

Europe (1.00)
North America > United States > Maryland (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Government (1.00)
Media > News (0.93)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis

Jukić, Josip

arXiv.org Artificial IntelligenceJul-17-2025

This thesis addresses challenges related to data and parameter efficiency in neural language models, with a focus on representation analysis and the introduction of new optimization techniques. The first part examines the properties and dynamics of language representations within neural models, emphasizing their significance in enhancing robustness and generalization. It proposes innovative approaches based on representation smoothness, including regularization strategies that utilize Jacobian and Hessian matrices to stabilize training and mitigate sensitivity to input perturbations. The second part focuses on methods to significantly enhance data and parameter efficiency by integrating active learning strategies with parameter-efficient fine-tuning, guided by insights from representation smoothness analysis. It presents smoothness-informed early-stopping techniques designed to eliminate the need for labeled validation sets and proposes innovative combinations of active learning and parameter-efficient fine-tuning to reduce labeling efforts and computational resources. Extensive experimental evaluations across various NLP tasks demonstrate that these combined approaches substantially outperform traditional methods in terms of performance, stability, and efficiency. The third part explores weak supervision techniques enhanced by in-context learning to effectively utilize unlabeled data, further reducing dependence on extensive labeling. It shows that using in-context learning as a mechanism for weak supervision enables models to better generalize from limited labeled data by leveraging unlabeled examples more effectively during training. Comprehensive empirical evaluations confirm significant gains in model accuracy, adaptability, and robustness, especially in low-resource settings and dynamic data environments.

large language model, machine learning, natural language, (27 more...)

arXiv.org Artificial Intelligence

2507.12004

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.27)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.92)
Law (0.67)
Education > Curriculum (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(6 more...)

Add feedback

Enhancing Clinical Text Classification via Fine-Tuned DRAGON Longformer Models

Yang, Mingchuan, Huang, Ziyuan

arXiv.org Artificial IntelligenceJul-15-2025

This study explores the optimization of the DRAGON Longformer base model for clinical text classification, specifically targeting the binary classification of medical case descriptions. A dataset of 500 clinical cases containing structured medical observations was used, with 400 cases for training and 100 for validation. Enhancements to the pre - trained joeranbosma/dragon - longformer - base - mixed - domain model included hyperparameter tuning, domain - specific preprocessing, and architectural adjustments. Key modifications involved increasing sequence length from 512 to 1024 tokens, adjusting learning rates from 1e - 05 to 5e - 06, extending training epochs from 5 to 8, and incorporating specialized medical terminology. The optimized model achieved notable performance gains: accuracy improved from 72.0% to 85.2%, precision from 68.0% to 84.1%, recall from 75.0% to 86.3%, and F1 - score from 71.0% to 85.2%. Statistical analysis confirmed the significance of these improvements (p < .001). The model demonstrated enhanced capability in interpreting medical terminology, anatomical measurements, and clinical observations. These findings contribute to domain - specific language model research and offer practical implications for clinical natural language processing applications. The optimized model ' s strong performance across diverse medical conditions underscores its potential for broad use in healthcare settings. Enhancing Clinical Text Classification via Fine - Tuned DRAGON Longformer Models Introduction Natural language processing (NLP) in healthcare has continued to advance rapidly, revolutionizing the ability to analyze clinical texts and automate the extraction of valuable insights from massive amounts of medical documentation (Khurana, Koli, Khatter, & Singh, 2023). Over the past few years, large language models (LLMs) have emerged as powerful tools for gaining insight from and processing clinical narratives, creating capabilities that have never been seen before in medical text classification, entity recognition, and clinical decision support (Wang et al., 2018). The DRAGON (Deep Representation Analysis for General - domain Ontology Networks) framework was a specialized version of medical text processing out of all these models (Bosma et al., 2025). Beltagy, Peters, and Cohan (2020) state that the DRAGON longformer model, built on top of the Longformer architecture, addresses the quadratic computational complexity issue of traditional transformer models by processing long sequences.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.0947

Country: North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (0.68)
Health & Medicine > Health Care Technology (0.46)
Health & Medicine > Diagnostic Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(3 more...)

Add feedback