mining method
Graph Neural Network-Driven Hierarchical Mining for Complex Imbalanced Data
Qi, Yijiashun, Lu, Quanchao, Dou, Shiyu, Sun, Xiaoxuan, Li, Muqing, Li, Yankaiqi
This study presents a hierarchical mining framework for high-dimensional imbalanced data, leveraging a depth graph model to address the inherent performance limitations of conventional approaches in handling complex, high-dimensional data distributions with imbalanced sample representations. By constructing a structured graph representation of the dataset and integrating graph neural network (GNN) embeddings, the proposed method effectively captures global interdependencies among samples. Furthermore, a hierarchical strategy is employed to enhance the characterization and extraction of minority class feature patterns, thereby facilitating precise and robust imbalanced data mining. Empirical evaluations across multiple experimental scenarios validate the efficacy of the proposed approach, demonstrating substantial improvements over traditional methods in key performance metrics, including pattern discovery count, average support, and minority class coverage. Notably, the method exhibits superior capabilities in minority-class feature extraction and pattern correlation analysis. These findings underscore the potential of depth graph models, in conjunction with hierarchical mining strategies, to significantly enhance the efficiency and accuracy of imbalanced data analysis. This research contributes a novel computational framework for high-dimensional complex data processing and lays the foundation for future extensions to dynamically evolving imbalanced data and multi-modal data applications, thereby expanding the applicability of advanced data mining methodologies to more intricate analytical domains.
Critical Example Mining for Vehicle Trajectory Prediction using Flow-based Generative Models
Precise trajectory prediction in complex driving scenarios is essential for autonomous vehicles. In practice, different driving scenarios present varying levels of difficulty for trajectory prediction models. However, most existing research focuses on the average precision of prediction results, while ignoring the underlying distribution of the input scenarios. This paper proposes a critical example mining method that utilizes a data-driven approach to estimate the rareness of the trajectories. By combining the rareness estimation of observations with whole trajectories, the proposed method effectively identifies a subset of data that is relatively hard to predict BEFORE feeding them to a specific prediction model. The experimental results show that the mined subset has higher prediction error when applied to different downstream prediction models, which reaches +108.1% error (greater than two times compared to the average on dataset) when mining 5% samples. Further analysis indicates that the mined critical examples include uncommon cases such as sudden brake and cancelled lane-change, which helps to better understand and improve the performance of prediction models.
NV-Retriever: Improving text embedding models with effective hard-negative mining
Moreira, Gabriel de Souza P., Osmulski, Radek, Xu, Mengyao, Ak, Ronay, Schifferer, Benedikt, Oldridge, Even
Text embedding models have been popular for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). Those models are typically Transformer models that are fine-tuned with contrastive learning objectives. Many papers introduced new embedding model architectures and training approaches, however, one of the key ingredients, the process of mining negative passages, remains poorly explored or described. One of the challenging aspects of fine-tuning embedding models is the selection of high quality hard-negative passages for contrastive learning. In this paper we propose a family of positive-aware mining methods that leverage the positive relevance score for more effective false negatives removal. We also provide a comprehensive ablation study on hard-negative mining methods over their configurations, exploring different teacher and base models. We demonstrate the efficacy of our proposed methods by introducing the NV-Retriever-v1 model, which scores 60.9 on MTEB Retrieval (BEIR) benchmark and 0.65 points higher than previous methods. The model placed 1st when it was published to MTEB Retrieval on July 07, 2024.
Offline versus Online Triplet Mining based on Extreme Distances of Histopathology Patches
Sikaroudi, Milad, Ghojogh, Benyamin, Safarpoor, Amir, Karray, Fakhri, Crowley, Mark, Tizhoosh, H. R.
We analyze the effect of offline and online triplet mining for colorectal cancer (CRC) histopathology dataset containing 100,000 patches. We consider the extreme, i.e., farthest and nearest patches to a given anchor, both in online and offline mining. While many works focus solely on selecting the triplets online (batch-wise), we also study the effect of extreme distances and neighbor patches before training in an offline fashion. We analyze extreme cases' impacts in terms of embedding distance for offline versus online mining, including easy positive, batch semi-hard, batch hard triplet mining, neighborhood component analysis loss, its proxy version, and distance weighted sampling. We also investigate online approaches based on extreme distance and comprehensively compare offline, and online mining performance based on the data patterns and explain offline mining as a tractable generalization of the online mining with large mini-batch size. As well, we discuss the relations of different colorectal tissue types in terms of extreme distances. We found that offline and online mining approaches have comparable performances for a specific architecture, such as ResNet-18 in this study. Moreover, we found the assorted case, including different extreme distances, is promising, especially in the online approach.
Product typicality attribute mining method based on a topic clustering ensemble - Artificial Intelligence Review
Despite the extensive application of topic models in natural language processing tasks in recent years, the Chinese texts of short comments characterised by large scale, high noise and small information points have put forward higher requirements for the accuracy and stability of the results, which fails to be satisfied by existing topic models. In this paper, a product typicality attribute mining method based on a topic clustering ensemble was proposed. By introducing multiple topic models into ensemble learning, the problems of semantic representation loss, clustering inefficiency and lack of interpretability in the mining of product typicality attributes of short comment texts should be solved. By an effective combination of the topic clustering algorithm based on the diversity of speech, the topic clustering ensemble algorithm based on the Non-negative matrix factorization, and the interpretation method of product typicality attributes based on the mean-shift algorithm, an unsupervised model of product typicality attribute mining for short comment texts is constructed. As shown by the experimental results, the modelling method assumes favourable performance in topic clustering and feature selection, suggesting its advantages in product typicality attribute identification and interpretability compared with common methods.
Acceleration of Large Margin Metric Learning for Nearest Neighbor Classification Using Triplet Mining and Stratified Sampling
Poorheravi, Parisa Abdolrahim, Ghojogh, Benyamin, Gaudet, Vincent, Karray, Fakhri, Crowley, Mark
Metric learning is one of the techniques in manifold learning with the goal of finding a projection subspace for increasing and decreasing the inter- and intra-class variances, respectively. Some of the metric learning methods are based on triplet learning with anchor-positive-negative triplets. Large margin metric learning for nearest neighbor classification is one of the fundamental methods to do this. Recently, Siamese networks have been introduced with the triplet loss. Many triplet mining methods have been developed for Siamese networks; however, these techniques have not been applied on the triplets of large margin metric learning for nearest neighbor classification. In this work, inspired by the mining methods for Siamese networks, we propose several triplet mining techniques for large margin metric learning. Moreover, a hierarchical approach is proposed, for acceleration and scalability of optimization, where triplets are selected by stratified sampling in hierarchical hyper-spheres. We analyze the proposed methods on three publicly available datasets, i.e., Fisher Iris, ORL faces, and MNIST datasets.
Internet of Things and data mining: From applications to techniques and systems - Gaber - - Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery - Wiley Online Library
The massive adoption of Internet of Things (IoT) opens a plethora of new use cases, applications, frameworks, and data processing architectures. A new ecosystem of supporting technologies is being developed in parallel with IoT to enable resource provisioning for resourceโconstrained devices and systems (Baktir, Ozgovde, & Ersoy, 2017; Mao, You, Zhang, Huang, & Letaief, 2017; F. Wang, Hu, Hu, Zhou, & Zhao, 2017). The core of future IoT systems will be designed by integrating mobile edge computing systems, softwareโdefined networks, 5G, augmented reality, and data mining (including machine learning and artificial intelligence) to name a few (Baktir et al., 2017; Mao et al., 2017). Data mining is the process of discovering hidden knowledge patterns from raw data; therefore, the execution of knowledge discovery processes in IoT environments will leverage the utility of IoT systems. In essence, data mining will play a vital role in highly interactive and intelligent IoT systems.
Internet of Things and data mining: From applications to techniques and systems - Gaber - - Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery - Wiley Online Library
The massive adoption of Internet of Things (IoT) opens a plethora of new use cases, applications, frameworks, and data processing architectures. A new ecosystem of supporting technologies is being developed in parallel with IoT to enable resource provisioning for resourceโconstrained devices and systems (Baktir, Ozgovde, & Ersoy, 2017; Mao, You, Zhang, Huang, & Letaief, 2017; F. Wang, Hu, Hu, Zhou, & Zhao, 2017). The core of future IoT systems will be designed by integrating mobile edge computing systems, softwareโdefined networks, 5G, augmented reality, and data mining (including machine learning and artificial intelligence) to name a few (Baktir et al., 2017; Mao et al., 2017). Data mining is the process of discovering hidden knowledge patterns from raw data; therefore, the execution of knowledge discovery processes in IoT environments will leverage the utility of IoT systems. In essence, data mining will play a vital role in highly interactive and intelligent IoT systems.
Deep Metric Learning by Online Soft Mining and Class-Aware Attention
Wang, Xinshao, Hua, Yang, Kodirov, Elyor, Hu, Guosheng, Robertson, Neil M.
Deep metric learning aims to learn a deep embedding that can capture the semantic similarity of data points. Given the availability of massive training samples, deep metric learning is known to suffer from slow convergence due to a large fraction of trivial samples. Therefore, most existing methods generally resort to sample mining strategies for selecting nontrivial samples to accelerate convergence and improve performance. In this work, we identify two critical limitations of the sample mining methods, and provide solutions for both of them. First, previous mining methods assign one binary score to each sample, i.e., dropping or keeping it, so they only selects a subset of relevant samples in a mini-batch. Therefore, we propose a novel sample mining method, called Online Soft Mining (OSM), which assigns one continuous score to each sample to make use of all samples in the mini-batch. OSM learns extended manifolds that preserve useful intraclass variances by focusing on more similar positives. Second, the existing methods are easily influenced by outliers as they are generally included in the mined subset. To address this, we introduce Class-Aware Attention (CAA) that assigns little attention to abnormal data samples. Furthermore, by combining OSM and CAA, we propose a novel weighted contrastive loss to learn discriminative embeddings. Extensive experiments on two fine-grained visual categorisation datasets and two video-based person re-identification benchmarks show that our method significantly outperforms the state-of-the-art.
MCA-based Rule Mining Enables Interpretable Inference in Clinical Psychiatry
Gao, Qingzhu, Gonzalez, Humberto, Ahammad, Parvez
Development of interpretable machine learning models for clinical healthcare applications has the potential of changing the way we understand, treat, and ultimately cure, diseases and disorders in many areas of medicine. Interpretable ML models for clinical healthcare can serve not only as sources of predictions and estimates, but also as discovery tools for clinicians and researchers to reveal new knowledge from the data. High dimensionality of patient information (e.g., phenotype, genotype, and medical history), lack of objective measurements, and the heterogeneity in patient populations often create significant challenges in developing interpretable machine learning models for clinical psychiatry in practice. In this paper we take a step towards the development of such interpretable models. First, by developing a novel categorical rule mining method based on Multivariate Correspondence Analysis (MCA) capable of handling datasets with large numbers of feature categories, and second, by applying this method to build a transdiagnostic Bayesian Rule List model to screen for neuropsychiatric disorders using Consortium for Neuropsychiatric Phenomics dataset. We show that our method is not only at least 100 times faster than state-of-the-art rule mining techniques for datasets with 50 features, but also provides interpretability and comparable prediction accuracy across several benchmark datasets.