Instance Selection
Quantum Annealing for Machine Learning: Applications in Feature Selection, Instance Selection, and Clustering
Pomeroy, Chloe, Pramov, Aleksandar, Thakrar, Karishma, Yendapalli, Lakshmi
This paper explores the applications of quantum annealing (QA) and classical simulated annealing (SA) to a suite of combinatorial optimization problems in machine learning, namely feature selection, instance selection, and clustering. We formulate each task as a Quadratic Unconstrained Binary Optimization (QUBO) problem and implement both quantum and classical solvers to compare their effectiveness. For feature selection, we propose several QUBO configurations that balance feature importance and redundancy, showing that quantum annealing (QA) produces solutions that are computationally more efficient. In instance selection, we propose a few novel heuristics for instance-level importance measures that extend existing methods. For clustering, we embed a classical-to-quantum pipeline, using classical clustering followed by QUBO-based medoid refinement, and demonstrate consistent improvements in cluster compactness and retrieval metrics. Our results suggest that QA can be a competitive and efficient tool for discrete machine learning optimization, even within the constraints of current quantum hardware.
- North America > United States > Georgia > Fulton County > Atlanta (0.14)
- Europe > Spain > Galicia > Madrid (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Data Science > Data Quality > Instance Selection (0.82)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Scalable Graph Attention-based Instance Selection via Mini-Batch Sampling and Hierarchical Hashing
Rustamov, Zahiriddin, Zaitouny, Ayham, Zaki, Nazar
Instance selection (IS) is important in machine learning for reducing dataset size while keeping key characteristics. Current IS methods often struggle with capturing complex relationships in high-dimensional spaces and scale with large datasets. This paper introduces a graph attention-based instance selection (GAIS) method that uses attention mechanisms to identify informative instances through their structural relationships in graph representations. We present two approaches for scalable graph construction: a distance-based mini-batch sampling technique that reduces computation through strategic batch processing, and a hierarchical hashing approach that allows for efficient similarity computation through random projections. The mini-batch approach keeps class distributions through stratified sampling, while the hierarchical hashing method captures relationships at multiple granularities through single-level, multi-level, and multi-view variants. Experiments across 39 datasets show that GAIS achieves reduction rates above 96\% while maintaining or improving model performance relative to state-of-the-art IS methods. The findings shows that the distance-based mini-batch approach offers an optimal balance of efficiency and effectiveness for large-scale datasets, while multi-view variants provide superior performance for complex, high-dimensional data, demonstrating that attention-based importance scoring can effectively identify instances crucial for maintaining decision boundaries without requiring exhaustive pairwise comparisons.
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Data Science > Data Quality > Instance Selection (0.82)
Meta-Instance Selection. Instance Selection as a Classification Problem with Meta-Features
Blachnik, Marcin, Ciepliński, Piotr
Data pruning, or instance selection, is an important problem in machine learning especially in terms of nearest neighbour classifier. However, in data pruning which speeds up the prediction phase, there is an issue related to the speed and efficiency of the process itself. In response, the study proposes an approach involving transforming the instance selection process into a classification task conducted in a unified meta-feature space where each instance can be classified and assigned to either the "to keep" or "to remove" class. This approach requires training an appropriate meta-classifier, which can be developed based on historical instance selection results from other datasets using reference instance selection methods as a labeling tool. This work proposes constructing the meta-feature space based on properties extracted from the nearest neighbor graph. Experiments conducted on 17 datasets of varying sizes and five reference instance selection methods (ENN, Drop3, ICF, HMN-EI, and CCIS) demonstrate that the proposed solution achieves results comparable to reference instance selection methods while significantly reducing computational complexity. In the proposed approach, the computational complexity of the system depends only on identifying the k-nearest neighbors for each data sample and running the meta-classifier. Additionally, the scaling law turns into the requirement of huge compute power both during training and prediction which is not always applicable in real live scenarios where the compute resources are limited. In that case, both the training data and the prediction model should require small computing resources. Therefore the training set should ensure a possible small size but keep the prediction accuracy of the original training set. This issue is not new, along with its development has been started primarily for the nearest neighbor classifier under the name of instance selection. Thus, already in the late 1960s and early 1970s, algorithms such as Condensed Nearest Neighbor (CNN), Edited Nearest Neighbor (ENN), and many others were developed. The benchmarks of instance selection indicate the Drop3 [3] and ICF [4] algorithms as the most wildly used, which, despite not being new, are characterized by excellent properties in terms of the balance between the prediction accuracy of the kNN algorithm and the reduction of the size of the stored set of prototypes (reduction_rate) [5]. These algorithms are also applicable not only as elements of the learning process for the kNN algorithm (prototype selection as part of the learning process) and hence could also be used as universal algorithms for reducing the size of the training set for any classifier, thereby accelerating the learning process of complex predictive models, the process of finding optimal parameters, etc. Examples of such applications can be found in [6, 7] or in [5].
- North America > United States (0.04)
- Europe > Poland (0.04)
- Europe > France (0.04)
- Asia (0.04)
- Information Technology > Data Science > Data Quality > Instance Selection (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (1.00)
GAIS: A Novel Approach to Instance Selection with Graph Attention Networks
Rustamov, Zahiriddin, Zaitouny, Ayham, Damseh, Rafat, Zaki, Nazar
Instance selection (IS) is a crucial technique in machine learning that aims to reduce dataset size while maintaining model performance. This paper introduces a novel method called Graph Attention-based Instance Selection (GAIS), which leverages Graph Attention Networks (GATs) to identify the most informative instances in a dataset. GAIS represents the data as a graph and uses GATs to learn node representations, enabling it to capture complex relationships between instances. The method processes data in chunks, applies random masking and similarity thresholding during graph construction, and selects instances based on confidence scores from the trained GAT model. Experiments on 13 diverse datasets demonstrate that GAIS consistently outperforms traditional IS methods in terms of effectiveness, achieving high reduction rates (average 96\%) while maintaining or improving model performance. Although GAIS exhibits slightly higher computational costs, its superior performance in maintaining accuracy with significantly reduced training data makes it a promising approach for graph-based data selection.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- Asia > China (0.04)
- Health & Medicine > Therapeutic Area > Oncology (0.68)
- Education (0.68)
Instance Selection for GANs
Recent advances in Generative Adversarial Networks (GANs) have led to their widespread adoption for the purposes of generating high quality synthetic imagery. While capable of generating photo-realistic images, these models often produce unrealistic samples which fall outside of the data manifold. Several recently proposed techniques attempt to avoid spurious samples, either by rejecting them after generation, or by truncating the model's latent space. While effective, these methods are inefficient, as a large fraction of training time and model capacity are dedicated towards samples that will ultimately go unused. In this work we propose a novel approach to improve sample quality: altering the training dataset via instance selection before model training has taken place.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Quality > Instance Selection (0.66)
Data as voters: instance selection using approval-based multi-winner voting
Sánchez-Fernández, Luis, Fisteus, Jesús A., López-Zaragoza, Rafael
Instance selection (or prototype selection) [García et al.(2015)] is a preprocessing task in machine learning (or data mining) that aims at selecting a subset of the data instances composing the training set that a machine learning algorithm will use. There are two main reasons to perform this task: efficiency and cleaning. Reducing the size of the training set reduces the computational cost of running the machine learning algorithm, especially in the case of instance-based classifiers like KNN (see the Preliminaries section for a description of KNN classifiers). Furthermore, we may be interested in removing noisy instances from the training set: instances due to errors or other causes can induce mistakes in the machine learning algorithm.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Information Technology > Data Science > Data Quality > Instance Selection (0.62)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.30)
Instance Selection Mechanisms for Human-in-the-Loop Systems in Few-Shot Learning
Jakubik, Johannes, Blumenstiel, Benedikt, Vössing, Michael, Hemmer, Patrick
Business analytics and machine learning have become essential success factors for various industries - with the downside of cost-intensive gathering and labeling of data. Few-shot learning addresses this challenge and reduces data gathering and labeling costs by learning novel classes with very few labeled data. In this paper, we design a human-in-the-loop (HITL) system for few-shot learning and analyze an extensive range of mechanisms that can be used to acquire human expert knowledge for instances that have an uncertain prediction outcome. We show that the acquisition of human expert knowledge significantly accelerates the few-shot model performance given a negligible labeling effort. We validate our findings in various experiments on a benchmark dataset in computer vision and real-world datasets. We further demonstrate the cost-effectiveness of HITL systems for few-shot learning. Overall, our work aims at supporting researchers and practitioners in effectively adapting machine learning models to novel classes at reduced costs.
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
- Information Technology > Data Science > Data Quality > Instance Selection (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Instance Selection Improves Geometric Mean Accuracy: A Study on Imbalanced Data Classification
Kuncheva, Ludmila I., Arnaiz-González, Álvar, Díez-Pastor, José-Francisco, Gunn, Iain A. D.
A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Wisconsin (0.04)
- North America > United States > Texas (0.04)
- (5 more...)
- Information Technology > Data Science > Data Quality > Instance Selection (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)
Instance Selection and Instance Weighting for Cross-Domain Sentiment Classification via PU Learning
Xia, Rui (Nanjing University of Science and Technology) | Hu, Xuelei (Nanjing University of Science and Technology) | Lu, Jianfeng (Nanjing University of Science and Technology) | Yang, Jian (Nanjing University of Science and Technology) | Zong, Chengqing (National Laboratory of Pattern Recognition, Institute of Automation)
Due to the explosive growth of the Internet online reviews, we can easily collect a large amount of labeled reviews from different domains. But only some of them are beneficial for training a desired target-domain sentiment classifier. Therefore, it is important for us to identify those samples that are the most relevant to the target domain and use them as training data. To address this problem, a novel approach, based on instance selection and instance weighting via PU learning, is proposed. PU learning is used at first to learn an in-target-domain selector, which assigns an in-target-domain probability to each sample in the training set. For instance selection, the samples with higher in-target-domain probability are used as training data; For instance weighting, the calibrated in-target-domain probabilities are used as sampling weights for training an instance-weighted naive Bayes model, based on the principle of maximum weighted likelihood estimation. The experimental results prove the necessity and effectiveness of the approach, especially when the size of training data is large. It is also proved that the larger the Kullback-Leibler divergence between the training and test data is, the more effective the proposed approach will be.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Quality > Instance Selection (0.80)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.69)
Entity Linking with Effective Acronym Expansion, Instance Selection and Topic Modeling
Zhang, Wei (National University of Singapore) | Sim, Yan-Chuan (Institute for Infocomm Research) | Su, Jian (Institute for Infocomm Research) | Tan, Chew-Lim (National University of Singapore)
Entity linking maps name mentions in the documents to entries in a knowledge base through resolving the name variations and ambiguities. In this paper, we propose three advancements for entity linking. Firstly, expanding acronyms can effectively reduce the ambiguity of the acronym mentions. However, only rule-based approaches relying heavily on the presence of text markers have been used for entity linking. In this paper, we propose a supervised learning algorithm to expand more complicated acronyms encountered, which leads to 15.1% accuracy improvement over state-of-the-art acronym expansion methods. Secondly, as entity linking annotation is expensive and labor intensive, to automate the annotation process without compromise of accuracy, we propose an instance selection strategy to effectively utilize the automatically generated annotation. In our selection strategy, an informative and diverse set of instances are selected for effective disambiguation. Lastly, topic modeling is used to model the semantic topics of the articles. These advancements give statistical significant improvement to entity linking individually. Collectively they lead the highest performance on KBP-2010 task.
- Asia > China (0.15)
- North America > United States > Texas (0.04)
- North America > United States > New York (0.04)
- (4 more...)
- Government > Regional Government (0.68)
- Health & Medicine (0.47)