AITopics | sample selection strategy

Collaborating Authors

sample selection strategy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An Empirical Study of Sample Selection Strategies for Large Language Model Repair

Li, Xuran, Wang, Jingyi

arXiv.org Artificial IntelligenceOct-24-2025

Large language models (LLMs) are increasingly deployed in real-world systems, yet they can produce toxic or biased outputs that undermine safety and trust. Post-hoc model repair provides a practical remedy, but the high cost of parameter updates motivates selective use of repair data. Despite extensive prior work on data selection for model training, it remains unclear which sampling criteria are most effective and efficient when applied specifically to behavioral repair of large generative models. Our study presents a systematic analysis of sample prioritization strategies for LLM repair. We evaluate five representative selection methods, including random sampling, K-Center, gradient-norm-based selection(GraNd), stratified coverage (CCS), and a Semantic-Aware Prioritized Sampling (SAPS) approach we proposed. Repair effectiveness and trade-offs are assessed through toxicity reduction, perplexity on WikiText-2 and LAMBADA, and three composite metrics: the Repair Proximity Score (RPS), the Overall Performance Score (OPS), and the Repair Efficiency Score (RES). Experimental results show that SAPS achieves the best balance between detoxification, utility preservation, and efficiency, delivering comparable or superior repair outcomes with substantially less data. Random sampling remains effective for large or robust models, while high-overhead methods such as CCS and GraNd provide limited benefit. The optimal data proportion depends on model scale and repair method, indicating that sample selection should be regarded as a tunable component of repair pipelines. Overall, these findings establish selection-based repair as an efficient and scalable paradigm for maintaining LLM reliability.

large language model, machine learning, repair method, (20 more...)

arXiv.org Artificial Intelligence

2510.20428

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Combating Noisy Labels via Dynamic Connection Masking

Zhang, Xinlei, Liu, Fan, Zhang, Chuanyi, Cheng, Fan, Zheng, Yuhui

arXiv.org Artificial IntelligenceOct-2-2025

Noisy labels are inevitable in real-world scenarios. Due to the strong capacity of deep neural networks to memorize corrupted labels, these noisy labels can cause significant performance degradation. Existing research on mitigating the negative effects of noisy labels has mainly focused on robust loss functions and sample selection, with comparatively limited exploration of regularization in model architecture. Inspired by the sparsity regularization used in Kolmogorov-Arnold Networks (KANs), we propose a Dynamic Connection Masking (DCM) mechanism for both Multi-Layer Perceptron Networks (MLPs) and KANs to enhance the robustness of classifiers against noisy labels. The mechanism can adaptively mask less important edges during training by evaluating their information-carrying capacity. Through theoretical analysis, we demonstrate its efficiency in reducing gradient error. Our approach can be seamlessly integrated into various noise-robust training methods to build more robust deep networks, including robust loss functions, sample selection strategies, and regularization techniques. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our method consistently outperforms state-of-the-art (SOTA) approaches. Furthermore, we are also the first to investigate KANs as classifiers against noisy labels, revealing their superior noise robustness over MLPs in real-world noisy scenarios. Our code will soon be publicly available.

artificial intelligence, machine learning, noisy label, (16 more...)

arXiv.org Artificial Intelligence

2508.09697

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Strategic Sample Selection for Improved Clean-Label Backdoor Attacks in Text Classification

Kirci, Onur Alp, Gursoy, M. Emre

arXiv.org Artificial IntelligenceAug-25-2025

Backdoor attacks pose a significant threat to the integrity of text classification models used in natural language proce ssing. While several dirty-label attacks that achieve high attack succe ss rates (ASR) have been proposed, clean-label attacks are inherently mor e difficult. In this paper, we propose three sample selection strategies to improve attack effectiveness in clean-label scenarios: Minimum, Above50, and Below50. Our strategies identify those samples which the model predi cts incorrectly or with low confidence, and by injecting backdoor trig gers into such samples, we aim to induce a stronger association betwee n the trigger patterns and the attacker-desired target label. We appl y our methods to clean-label variants of four canonical backdoor atta cks (Insert-Sent, WordInj, StyleBkd, SynBkd) and evaluate them on three datasets (IMDB, SST2, HateSpeech) and four model types (LSTM, BERT, D istilBERT, RoBERTa). Results show that the proposed strategi es, particularly the Minimum strategy, significantly improve the ASR o ver random sample selection with little or no degradation in the mod el's clean accuracy. Furthermore, clean-label attacks enhanced by ou r strategies outperform BITE, a state of the art clean-label attack metho d, in many configurations.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.15934

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Exploring Video-Based Driver Activity Recognition under Noisy Labels

Fan, Linjuan, Wen, Di, Peng, Kunyu, Yang, Kailun, Zhang, Jiaming, Liu, Ruiping, Chen, Yufan, Zheng, Junwei, Wu, Jiamin, Han, Xudong, Stiefelhagen, Rainer

arXiv.org Artificial IntelligenceAug-12-2025

As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at https://github.com/ilonafan/DAR-noisy-labels.

artificial intelligence, machine learning, recognition, (16 more...)

arXiv.org Artificial Intelligence

2504.11966

Country: Asia > China (0.28)

Genre: Research Report (0.64)

Industry:

Education (0.68)
Law (0.56)
Automobiles & Trucks (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)

Add feedback

Dependency-aware Maximum Likelihood Estimation for Active Learning

Kalkanli, Beyza, Imbiriba, Tales, Ioannidis, Stratis, Erdogmus, Deniz, Dy, Jennifer

arXiv.org Machine LearningMar-7-2025

Active learning aims to efficiently build a labeled training set by strategically selecting samples to query labels from annotators. In this sequential process, each sample acquisition influences subsequent selections, causing dependencies among samples in the labeled set. However, these dependencies are overlooked during the model parameter estimation stage when updating the model using Maximum Likelihood Estimation (MLE), a conventional method that assumes independent and identically distributed (i.i.d.) data. We propose Dependency-aware MLE (DMLE), which corrects MLE within the active learning framework by addressing sample dependencies typically neglected due to the i.i.d. assumption, ensuring consistency with active learning principles in the model parameter estimation process. This improved method achieves superior performance across multiple benchmark datasets, reaching higher performance in earlier cycles compared to conventional MLE. Specifically, we observe average accuracy improvements of 6\%, 8.6\%, and 10.5\% for $k=1$, $k=5$, and $k=10$ respectively, after collecting the first 100 samples, where entropy is the acquisition function and $k$ is the query batch size acquired at every active learning cycle.

active learning, active learning cycle, learning, (11 more...)

arXiv.org Machine Learning

2503.05969

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(5 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation for Classification

Cegin, Jan, Pecher, Branislav, Simko, Jakub, Srba, Ivan, Bielikova, Maria, Brusilovsky, Peter

arXiv.org Artificial IntelligenceOct-14-2024

The generative large language models (LLMs) are increasingly used for data augmentation tasks, where text samples are paraphrased (or generated anew) and then used for classifier fine-tuning. Existing works on augmentation leverage the few-shot scenarios, where samples are given to LLMs as part of prompts, leading to better augmentations. Yet, the samples are mostly selected randomly and a comprehensive overview of the effects of other (more ``informed'') sample selection strategies is lacking. In this work, we compare sample selection strategies existing in few-shot learning literature and investigate their effects in LLM-based textual augmentation. We evaluate this on in-distribution and out-of-distribution classifier performance. Results indicate, that while some ``informed'' selection strategies increase the performance of models, especially for out-of-distribution data, it happens only seldom and with marginal performance increases. Unless further advances are made, a default of random sample selection remains a good option for augmentation practitioners.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.10756

Country:

Asia > Singapore (0.05)
North America > Canada > Ontario > Toronto (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)
(7 more...)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.67)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Deep Learning-Based Channel Squeeze U-Structure for Lung Nodule Detection and Segmentation

Sui, Mingxiu, Hu, Jiacheng, Zhou, Tong, Liu, Zibo, Wen, Likang, Du, Junliang

arXiv.org Artificial IntelligenceSep-20-2024

This paper introduces a novel deep-learning method for the automatic detection and segmentation of lung nodules, aimed at advancing the accuracy of early-stage lung cancer diagnosis. The proposed approach leverages a unique "Channel Squeeze U-Structure" that optimizes feature extraction and information integration across multiple semantic levels of the network. This architecture includes three key modules: shallow information processing, channel residual structure, and channel squeeze integration. These modules enhance the model's ability to detect and segment small, imperceptible, or ground-glass nodules, which are critical for early diagnosis. The method demonstrates superior performance in terms of sensitivity, Dice similarity coefficient, precision, and mean Intersection over Union (IoU). Extensive experiments were conducted on the Lung Image Database Consortium (LIDC) dataset using five-fold cross-validation, showing excellent stability and robustness. The results indicate that this approach holds significant potential for improving computer-aided diagnosis systems, providing reliable support for radiologists in clinical practice and aiding in the early detection of lung cancer, especially in resource-limited settings

application, nodule, segmentation, (12 more...)

arXiv.org Artificial Intelligence

2409.13868

Country:

North America > United States > Iowa (0.04)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Parameter-Efficient Active Learning for Foundational models

Narayanan, Athmanarayanan Lakshmi, Krishnan, Ranganath, Machireddy, Amrutha, Subedar, Mahesh

arXiv.org Artificial IntelligenceJun-14-2024

Foundational vision transformer models have shown impressive few shot performance on many vision tasks. This research presents a novel investigation into the application of parameter efficient fine-tuning methods within an active learning (AL) framework, to advance the sampling selection process in extremely budget constrained classification tasks. The focus on image datasets, known for their out-of-distribution characteristics, adds a layer of complexity and relevance to our study. Through a detailed evaluation, we illustrate the improved AL performance on these challenging datasets, highlighting the strategic advantage of merging parameter efficient fine tuning methods with foundation models. This contributes to the broader discourse on optimizing AL strategies, presenting a promising avenue for future exploration in leveraging foundation models for efficient and effective data annotation in specialized domains.

active learning, foundation model, learning, (12 more...)

arXiv.org Artificial Intelligence

2406.09296

Country:

North America > United States > Oregon > Washington County > Hillsboro (0.05)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
Asia > India > Karnataka > Bengaluru (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Batch-in-Batch: a new adversarial training framework for initial perturbation and sample selection

Wu, Yinting, Peng, Pai, Cai, Bo, Li, Le, ., null

arXiv.org Artificial IntelligenceJun-6-2024

Adversarial training methods commonly generate independent initial perturbation for adversarial samples from a simple uniform distribution, and obtain the training batch for the classifier without selection. In this work, we propose a simple yet effective training framework called Batch-in-Batch (BB) to enhance models robustness. It involves specifically a joint construction of initial values that could simultaneously generates $m$ sets of perturbations from the original batch set to provide more diversity for adversarial samples; and also includes various sample selection strategies that enable the trained models to have smoother losses and avoid overconfident outputs. Through extensive experiments on three benchmark datasets (CIFAR-10, SVHN, CIFAR-100) with two networks (PreActResNet18 and WideResNet28-10) that are used in both the single-step (Noise-Fast Gradient Sign Method, N-FGSM) and multi-step (Projected Gradient Descent, PGD-10) adversarial training, we show that models trained within the BB framework consistently have higher adversarial accuracy across various adversarial settings, notably achieving over a 13% improvement on the SVHN dataset with an attack radius of 8/255 compared to the N-FGSM baseline model. Furthermore, experimental analysis of the efficiency of both the proposed initial perturbation method and sample selection strategies validates our insights. Finally, we show that our framework is cost-effective in terms of computational resources, even with a relatively large value of $m$.

accuracy, adversarial sample, initial perturbation, (13 more...)

arXiv.org Artificial Intelligence

2406.0407

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > China > Hubei Province > Wuhan (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Automatic Combination of Sample Selection Strategies for Few-Shot Learning

Pecher, Branislav, Srba, Ivan, Bielikova, Maria, Vanschoren, Joaquin

arXiv.org Artificial IntelligenceFeb-5-2024

In few-shot learning, such as meta-learning, few-shot fine-tuning or in-context learning, the limited number of samples used to train a model have a significant impact on the overall success. Although a large number of sample selection strategies exist, their impact on the performance of few-shot learning is not extensively known, as most of them have been so far evaluated in typical supervised settings only. In this paper, we thoroughly investigate the impact of 20 sample selection strategies on the performance of 5 few-shot learning approaches over 8 image and 6 text datasets. In addition, we propose a new method for automatic combination of sample selection strategies (ACSESS) that leverages the strengths and complementary information of the individual strategies. The experimental results show that our method consistently outperforms the individual selection strategies, as well as the recently proposed method for selecting support examples for in-context learning. We also show a strong modality, dataset and approach dependence for the majority of strategies as well as their dependence on the number of shots - demonstrating that the sample selection strategies play a significant role for lower number of shots, but regresses to random selection at higher number of shots.

learning, selection, selection strategy, (12 more...)

arXiv.org Artificial Intelligence

2402.03038

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Ontario > Toronto (0.04)
North America > Dominican Republic (0.04)
(14 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback