Clustering
MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels
Ploshchik, Ilya, Chatzimparmpas, Angelos, Kerren, Andreas
Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.
Hard Sample Aware Network for Contrastive Deep Graph Clustering
Liu, Yue, Yang, Xihong, Zhou, Sihang, Liu, Xinwang, Wang, Zhen, Liang, Ke, Tu, Wenxuan, Li, Liang, Duan, Jingcan, Chen, Cancan
Contrastive deep graph clustering, which aims to divide nodes into disjoint groups via contrastive mechanisms, is a challenging research spot. Among the recent works, hard sample mining-based algorithms have achieved great attention for their promising performance. However, we find that the existing hard sample mining methods have two problems as follows. 1) In the hardness measurement, the important structural information is overlooked for similarity calculation, degrading the representativeness of the selected hard negative samples. 2) Previous works merely focus on the hard negative sample pairs while neglecting the hard positive sample pairs. Nevertheless, samples within the same cluster but with low similarity should also be carefully learned. To solve the problems, we propose a novel contrastive deep graph clustering method dubbed Hard Sample Aware Network (HSAN) by introducing a comprehensive similarity measure criterion and a general dynamic sample weighing strategy. Concretely, in our algorithm, the similarities between samples are calculated by considering both the attribute embeddings and the structure embeddings, better revealing sample relationships and assisting hardness measurement. Moreover, under the guidance of the carefully collected high-confidence clustering information, our proposed weight modulating function will first recognize the positive and negative samples and then dynamically up-weight the hard sample pairs while down-weighting the easy ones. In this way, our method can mine not only the hard negative samples but also the hard positive sample, thus improving the discriminative capability of the samples further. Extensive experiments and analyses demonstrate the superiority and effectiveness of our proposed method.
Neural Gas Network Image Features and Segmentation for Brain Tumor Detection Using Magnetic Resonance Imaging Data
Accurate detection of brain tumors could save lots of lives and increasing the accuracy of this binary classification even as much as a few percent has high importance. Neural Gas Networks (NGN) is a fast, unsupervised algorithm that could be used in data clustering, image pattern recognition, and image segmentation. In this research, we used the metaheuristic Firefly Algorithm (FA) for image contrast enhancement as pre-processing and NGN weights for feature extraction and segmentation of Magnetic Resonance Imaging (MRI) data on two brain tumor datasets from the Kaggle platform. Also, tumor classification is conducted by Support Vector Machine (SVM) classification algorithms and compared with a deep learning technique plus other features in train and test phases. Additionally, NGN tumor segmentation is evaluated by famous performance metrics such as Accuracy, F-measure, Jaccard, and more versus ground truth data and compared with traditional segmentation techniques. The proposed method is fast and precise in both tasks of tumor classification and segmentation compared with other methods. A classification accuracy of 95.14 % and segmentation accuracy of 0.977 is achieved by the proposed method.
Hierarchical clustering: visualization, feature importance and model selection
Cabezas, Luben M. C., Izbicki, Rafael, Stern, Rafael B.
We propose methods for the analysis of hierarchical clustering that fully use the multi-resolution structure provided by a dendrogram. Specifically, we propose a loss for choosing between clustering methods, a feature importance score and a graphical tool for visualizing the segmentation of features in a dendrogram. Current approaches to these tasks lead to loss of information since they require the user to generate a single partition of the instances by cutting the dendrogram at a specified level. Our proposed methods, instead, use the full structure of the dendrogram. The key insight behind the proposed methods is to view a dendrogram as a phylogeny. This analogy permits the assignment of a feature value to each internal node of a tree through an evolutionary model. Real and simulated datasets provide evidence that our proposed framework has desirable outcomes and gives more insights than state-of-art approaches. We provide an R package that implements our methods.
A Closer Look at Novel Class Discovery from the Labeled Set
Li, Ziyun, Otholt, Jona, Dai, Ben, hu, Di, Meinel, Christoph, Yang, Haojin
Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset leveraging prior knowledge of a labeled set comprising disjoint but related classes. Existing research focuses primarily on utilizing the labeled set at the methodological level, with less emphasis on the analysis of the labeled set itself. Thus, in this paper, we rethink novel class discovery from the labeled set and focus on two core questions: (i) Given a specific unlabeled set, what kind of labeled set can best support novel class discovery? (ii) A fundamental premise of NCD is that the labeled set must be related to the unlabeled set, but how can we measure this relation? For (i), we propose and substantiate the hypothesis that NCD could benefit more from a labeled set with a large degree of semantic similarity to the unlabeled set. Specifically, we establish an extensive and large-scale benchmark with varying degrees of semantic similarity between labeled/unlabeled datasets on ImageNet by leveraging its hierarchical class structure. As a sharp contrast, the existing NCD benchmarks are developed based on labeled sets with different number of categories and images, and completely ignore the semantic relation. For (ii), we introduce a mathematical definition for quantifying the semantic similarity between labeled and unlabeled sets. In addition, we use this metric to confirm the validity of our proposed benchmark and demonstrate that it highly correlates with NCD performance. Furthermore, without quantitative analysis, previous works commonly believe that label information is always beneficial. However, counterintuitively, our experimental results show that using labels may lead to sub-optimal outcomes in low-similarity settings.
Cluster-Based Control of Transition-Independent MDPs
Fiscko, Carmel, Kar, Soummya, Sinopoli, Bruno
This work studies efficient solution methods for cluster-based control policies of transition-independent Markov decision processes (TI-MDPs). We focus on control of multi-agent systems, whereby a central planner (CP) influences agents to select desirable group behavior. The agents are partitioned into disjoint clusters whereby agents in the same cluster receive the same controls but agents in different clusters may receive different controls. Under mild assumptions, this process can be modeled as a TI-MDP where each factor describes the behavior of one cluster. The action space of the TI-MDP becomes exponential with respect to the number of clusters. To efficiently find a policy in this rapidly scaling space, we propose a clustered Bellman operator that optimizes over the action space for one cluster at any evaluation. We present Clustered Value Iteration (CVI), which uses this operator to iteratively perform "round robin" optimization across the clusters. CVI converges exponentially faster than standard value iteration (VI), and can find policies that closely approximate the MDP's true optimal value. A special class of TI-MDPs with separable reward functions are investigated, and it is shown that CVI will find optimal policies on this class of problems. Finally, the optimal clustering assignment problem is explored. The value functions TI-MDPs with submodular reward functions are shown to be submodular functions, so submodular set optimization may be used to find a near optimal clustering assignment. We propose an iterative greedy cluster splitting algorithm, which yields monotonic submodular improvement in value at each iteration. Finally, simulations offer empirical assessment of the proposed methods.
AlignGraph: A Group of Generative Models for Graphs
Shayestehfard, Kimia, Brooks, Dana, Ioannidis, Stratis
This is a problem because state-of-the-art generative models, like the ones listed above, rely on latent It is challenging for generative models to learn a distribution node embeddings. Such embeddings vary drastically over graphs because of the lack of permutation even under nearly isomorphic graphs [13]. In turn, this invariance: nodes may be ordered arbitrarily across can hamper the fidelity of the graph generation process graphs, and standard graph alignment is combinatorial significantly. Note that this is a much harder setting and notoriously expensive. We propose AlignGraph, a than, e.g., images or text, where inputs have a canonical group of generative models that combine fast and efficient orientation. Finding the correspondence between graph graph alignment methods with a family of deep nodes is a notoriously hard problem [14, 15, 11, 16], and generative models that are invariant to node permutations.
Cross Modal Global Local Representation Learning from Radiology Reports and X-Ray Chest Images
Hadjiyski, Nathan, Vosoughi, Ali, Wismueller, Axel
Deep learning models can be applied successfully in real-work problems; however, training most of these models requires massive data. Recent methods use language and vision, but unfortunately, they rely on datasets that are not usually publicly available. Here we pave the way for further research in the multimodal language-vision domain for radiology. In this paper, we train a representation learning method that uses local and global representations of the language and vision through an attention mechanism and based on the publicly available Indiana University Radiology Report (IU-RR) dataset. Furthermore, we use the learned representations to diagnose five lung pathologies: atelectasis, cardiomegaly, edema, pleural effusion, and consolidation. Finally, we use both supervised and zero-shot classifications to extensively analyze the performance of the representation learning on the IU-RR dataset. Average Area Under the Curve (AUC) is used to evaluate the accuracy of the classifiers for classifying the five lung pathologies. The average AUC for classifying the five lung pathologies on the IU-RR test set ranged from 0.85 to 0.87 using the different training datasets, namely CheXpert and CheXphoto. These results compare favorably to other studies using UI-RR. Extensive experiments confirm consistent results for classifying lung pathologies using the multimodal global local representations of language and vision information.
Re-embedding data to strengthen recovery guarantees of clustering
Jiang, Tao, Tan, Samuel, Vavasis, Stephen
We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.
Improving Spectral Clustering Using Spectrum-Preserving Node Aggregation
Spectral clustering is one of the most popular clustering methods. However, the high computational cost due to the involved eigen-decomposition procedure can immediately hinder its applications in large-scale tasks. In this paper we use spectrum-preserving node reduction to accelerate eigen-decomposition and generate concise representations of data sets. Specifically, we create a small number of pseudonodes based on spectral similarity. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, each data point in the original data set is assigned to the cluster as its representative pseudo-node. The proposed framework run in nearly-linear time. Meanwhile, the clustering accuracy can be significantly improved by mining concise representations. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.